Beyond just the 20 proteinogenic AAs, the challenge confronting direct protein sequencing is compounded by isoforms. So, several hundred AA calls (acid calls) that discriminate between subcubic nanometer volumes with subnanometer resolution are required to sequence it. Human proteins have about 375 AA residues ( 15). The primary structure consists of a linear sequence, drawn from 20 proteinogenic AAs with an average volume of about 0.1 nm 3, linked by peptide bonds separated by only 0.38 nm in equilibrium. However, sequencing a whole protein is a tall order. The prevalence of heterogeneity in mRNA translation ( 1), posttranslational modifications (PTMs) and posttranslational structural processing are revealed only by direct protein-level analysis ( 14), and there is a pressing need for it. Thus, although it is relatively inexpensive to do so, reading the genome or transcriptome does not buy everything. There are many gene-specific effects in translational efficiency, such as posttranscriptional regulation including RNA modifications ( 12) and even the lengths of the polyA tails added to RNA, that can change the lifetime and the rate of protein production from these mRNAs ( 13), which necessitates unambiguous detection of the protein. Moreover, measurement of RNA transcription offers only a deceptive link to proteins in a cell or tissue-it does not provide a quantitative measure of the protein level ( 11). On the other hand, frameshifts can be detected easily by looking directly at the AA sequence. Long-read, human genome assemblies retain indels in as many as 580 assembled transcripts (1.5%) ( 10), which makes it difficult to distinguish mutations from artifacts. The insertion/deletion (indel) of a single base, the predominant error in long-read sequencing technologies, can cause a titanic change in the inferred primary structure of the protein by frameshifting. Genome assembly does not perfectly capture protein coding de novo as many assemblies retain an error rate of 0.1%, which, in a 5-Mb genome like Escherichia coli, corresponds to about 5000 errors. For example, paradoxically, an analysis of a “well-characterized” human transcriptome has revealed 116,156 novel transcripts not present in existing databases ( 9).
However, despite the early start, sequencing proteins has lagged behind.īecause they are affordable (the price to sequence a whole genome fell to about $1000 in 2016) and so easy to use, genomic (DNA) and transcriptomic (RNA) sequencing have also been applied to characterize the primary structure of a protein indirectly, but they do not capture the full spectrum of protein-coding genes.
One measure is the size of the sequencing read archive ( 8), a public repository of research sequencing data at the National Center for Biotechnology Information, which currently hosts ~33 P Sequencing DNA and RNA has produced an enormous amount of data. RNA has benefited similarly because of the use of reverse transcriptase to make complementary DNA (cDNA) from RNA, which is then sequenced with DNA sequencing methods. Exploiting the development of polymerase chain reaction and other enzymatic methods, DNA sequencing became the focus with relentless improvement in yield, throughput, and cost exceeding “Moore’s law” ( 7), which has been used to gauge improvements in semiconductor device performance over the years. DNA sequencing followed using a variety of methods, both additive and degradative ( 6). Later, teams led by Holley ( 4) working on transfer RNA and Sanger ( 5) working on ribosomal RNA performed the first RNA sequencing. ( 3) analyzed the AA sequences of proteins first, early in the 1950s. It comes as no surprise then that Sanger and Tuppy ( 2) and Edman et al. Proteins can be the root cause of diseases (such as Alzheimer’s or Huntington’s disease), and they can be used to cure it (e.g., antibodies are used as therapeutics against viral and bacterial infections). The protein structure dictates the function (or dysfunction). They dictate cellular structure and activity, provide the mechanisms for signaling between cells and tissues, and catalyze chemical reactions that support metabolism. Proteins are the molecules that make biology work.