A late development of genetics occurring after the inception of protein-catalyzed peptide bond synthesis requires that information in primordial peptides and proteins that preceded mature forms of dipeptidases and ligases be coopted by the emerging genetic system  , . Specifically, the primordial structural cores of the four FFs that appeared prior to catalytic domains of aaRSs  ,  must be encoded by the emerging aaRS-based genetic system in interaction with primordial tRNA or cofactors.
In other words, useful links between amino acid composition and structure that benefit molecular functions must be coordinately registered as nucleic acid cofactor-linked specificities by the emerging dipeptidase and ligases, making those same structures accessible to the cells. However, if early peptides were quasi-statistical ensembles curbed by strong biases in amino acid and dipeptide makeup, as previous experimental evidence suggests  , then these biases would be carried on coopted by the emerging nucleic acid-protein recognition mechanisms that were being embedded in the unfolding translation system.
This would have ensured that the emergent coding rules would preserve the primordial structural biases of the initial four FFs that were existent at that time. In fact, a recent study of the effect of mistranslation on the codons of amino acids that bind or do not bind to small molecular ligands support the robustness to genotypic change of functions linked to active sites in proteins phenotypes .
These findings provide further support to a long line of evidence that suggest that natural selection for error mitigation can affect the robustness of more elementary units of molecular structure our prior forms at the translation level e. Here, hypothesis h is a conjunction ensemble that includes phylogenomic trees, timelines of aaRS domains and tRNA structures, minimum number of additional steps required to force hypotheses of monophyly, and statistical links between amino acid and dipeptide compositions associated with protein secondary structure and historical statements, all of which explain the actual patterns and processes of origin.
Evidence e include matrices of FF structural homologies and tRNA substructural features from information in genomes and molecules, biases in amino acid and dipeptide compositions of FFs, genetic code complementarity and degeneracy and mappings to amino acid physicochemical properties, and functional and structural evidence from biochemistry and structural biology that support amino acid charging specificities. A number of assumptions relate to e that we want to make explicit: i Structural homologies at FF level and tRNA substructure level rely on the ability of hidden Markov models HMMs to encapsulate the structural design of molecules even in the presence of continuities in the space of protein and RNA sequence  , .
This requires establishing optimal E cutoff values capable of maximizing consistency and information content of thousands of phylogenetic characters, optimizing methods of topological correspondence, and increasing the machine learning power of HMMs of structural recognition. This necessitates confidence in the statistical approaches and biological rationale we use to link history, structure and amino acid composition, especially because our understanding of the operational code and archaic CDPSs and aaRSs is limited and mostly based on interactions with tRNA or RNA analogs, and not with other cofactors.
Background knowledge b includes support for character transformation series defined by the evolutionary model of growth of molecular structures and hypotheses of character polarization rooting see Materials and Methods , the existence of a molecular clock of domain structures that link phylogenomic evidence with evidence from fossils and geochemical, biochemical, and biomarker data  , minimization of risks of confusion in the recognition of complementary anticodon-codon pairs  , reduction of genetic code ambiguity  , and compliance with evidence from biochemistry, molecular biology and genomics that we will not make here explicit.
Increases in molecular abundance with time ensure innovations are not easily lost and structural canalization of conformers enable durable molecular functions. Important predictions can be made from h , some of which we have already discussed, including the effect of compositional biases in mutational robustness. In particular, history of dipeptides and other elementary units of molecular structure are expected to manifest in fast-evolving genomic sequences responsible for de novo creation of genes  ,  as well as in the protracted history of mature protein-encoding genes.
Phylogenetic analysis should be able to make that history evident. In this study we mapped the evolution of aaRS domains in a published evolutionary timeline of domain appearance at fold family FF level of structural abstraction  , . This timeline was selected for a number of reasons: FFs generally provide structures with unambiguous assignments of molecular functions, the timeline is well annotated, and results can be benchmarked to a description of the rise of early structures and functions .
The timeline was derived from a phylogenomic tree of 2, FF structures out of 3, defined by the structural classification of proteins scop 1. The timeline was for all purposes congruent to a timeline derived from a phylogenomic tree of 3, FFs out of 3, defined by scop 1.
Early Thoughts on RNA and the Origin of Life The full impact of the essential role of the nucleic acids in biological systems was forcefully demonstrated by the. Acta Biotheor. ;25(4) The genetic code and the origin of life. Berger J. The problem of the origin of life understandably counts as one of the most.
We note that the phylogenomic approach based on structure summarized in the flow diagram of Figure 1A is impervious to a number of limitations that plague sequence analysis, such as problems of alignment, character independence, inapplicable characters, saturation and taxon sampling  , and is even robust against uneven sampling of genomes across the three superkingdoms . Despite robust evolutionary trends across phylogenies  , the exact order of closely positioned FFs can be debatable in phylogenetic reconstructions of trees with thousands of leaves.
For this reason, we sub-selected domains that were part of aaRSs and generated rooted trees describing the evolution of only the FFs associated with these enzymes Figure S1 , panel A.
Tree reconstructions were carried out using maximum parsimony as optimality criterion and a combined parsimony ratchet as previously described  , . The trees were rooted by the Lundberg method, which does not impose a requirement of outgroup taxa.
Phylogenetic reliability was evaluated by the nonparametric bootstrap method with 1, replicates, with resampling size being the same as the number of the genomes sampled, TBR, and maxtrees unrestricted. The structure of phylogenetic signal in the data was tested by the skewness g 1 of the length distribution of 1, random trees . Recovered trees were well resolved and had basal topologies that matched those of homologous subtrees in the published trees of FFs.
Bootstrap support BS values for basal branches were and ranged 56—83 in more derived branches; the very derived regions were variable within the 9 most parsimonious trees that were retained. This indicates topologies provide strong support to phylogenetic statements, with support increasing towards the base of the tree. In a recent study, we also reconstructed trees of aaRS domains . These trees were derived from a census of protein structure in 1, genomes that included organisms in the three superkingdoms and viruses. Again, topologies were remarkably consistent.
Given these results, we used domain ages obtained from the global published phylogenies to place aaRSs in the timeline along with other domains linked to the ribosome and non-ribosomal protein synthetases NRPS that we used as reference Figure 2.
For simplicity, domains are here identified with concise classification strings ccs. The relative age of protein structures nd was calculated directly from the rooted trees using a script that counts the number of nodes from the root base of the tree to each leaf and provides it in a relative zero-to-one scale. These nd values take advantage of the highly imbalanced nature of the trees of domain structures, as recently discussed .
Tree imbalance in these trees is a natural consequence of a heritable trait  , the accumulation of domain structures in proteins and proteomes  , which naturally poises speciation . Moreover, we find that trees do not follow random or Yule models of speciation, which can be considered to drive the evolution of species . The nd values are also good proxy for geological time. We note that extending the clock to FFs showed that domain age continued to be proportional to time but with larger dispersion at high nd FF values.
Clocks were calibrated with geological ages derived from the study of fossils and geochemical, biochemical, and biomarker data, which are affected by the validity of the assumptions used in each and every one of the supporting studies . We also note that the molecular clock derived from trees of Fs and FSFs is necessarily dependent on the rates of domain discovery and accumulation that could be deviant for some domain structures.
These factors could cause departures from a clock, with overdispersion sometimes resulting from changes in foldability and structural stability of domains . The ages of tRNA molecules here used were derived from published timelines of amino acid charging and encoding generated from trees of tRNA structure .
The method extracts phylogenetic signatures from structural topology in RNA  ,  ,  ,  ,  — . These signatures are drawn from links between secondary structure and conformation, dynamics and adaptation . Geometrical and statistical features of RNA substructures are scored in thousands of molecules and this information is analyzed with modern phylogenetic methods to produce trees of molecules and trees of substructures that portray the history of the system molecules or its component parts substructures , respectively Figure 1C. The phylogenetic model automatically roots the trees by assuming conformational stability increases in evolution as structures become canalized.
The validity of polarization and rooting depends on the axiomatic component of character transformation, which is supported by considerable evidence and is also falsifiable . Phylogenetic constraint analysis restricts the search for optimal trees of tRNAs to pre-specified topologies  ,  and can provide important insights from trees of molecules.
Here, the minimum number of additional steps S required to force groups of tRNA taxa in trees non-mutually exclusive hypotheses defined measures of ancestrality of individual groups and were used to build evolutionary timelines of amino acid charging and amino acid encoding. Hypotheses with smaller S were considered less affected by recruitment and represented processes that were more ancient. Using this approach, chronologies of amino acid charging and codon discovery were directly derived for isoacceptor S aac and anticodon-specific S cod tRNAs, respectively.
The validity of character argumentation and the assumption that groups that require lower number of steps are deemed more ancient was derived from the rooted trees and the model of character polarization . Regression lines unfolded evolutionary timelines of archaic editing functions and anticodon-binding specificities, once editing and other accessory domains were identified.
Regression timelines were also compared to a conservative idealized timeline that spans domain age and underweights tRNA evolution relative rates of structural change in protein domains can be 4. Putative imprints in the primordial complementarity proposed to exist in the genetic code were borrowed directly from Rodin and Rodin .
Identity elements in tRNA were identified by in vitro and in vivo approaches  — . We analyzed amino acid frequencies in secondary structures for a non-culled and a culled set of protein entries from the Protein Data Bank PDB Figure S1. Secondary structures were assigned using the DSSP program .
The non-culled set included , domain sequences 51,, amino acids downloaded from the PDB June 20, The culled set included 6, sequences 1,, amino acids. All PDB sequences with two or more FFs or unassigned ranges longer than 30 amino acids were eliminated from the study. We also examined whether or not each of the 20 amino acids or the possible dipeptides that included 2 ambiguous amino acids, Z and X, was respectively enriched in FFs that were more ancient than the oldest anticodon-binding domain of the timeline sequences appearing earlier than c.
For each of the two sequence sets, we counted the numbers of multiple occurrences of amino acids or dipeptides and then calculated the probability of enrichment of every amino acid or dipeptide that was present in the ancient sequence set using the hypergeometric distribution and the following equation  :.
Observed values M and k indicate the numbers of multiple occurrences of examined amino acids or dipeptides in the 2,sequence and the sequence set, respectively.
Because it would be nonparsimonious to assume that later codes were first, we describe the evolution of the most parsimonious codes. The lack of intrinsic accuracy led to proofreading and repair systems. A simple physical mechanism enables homeostasis in primitive cells. Archived from the original on 27 November We anticipate that the new genomic methods described here will complement those studies, point to new research directions, and allow us to chronologically order additional major biological origin of life transitions.
The values N and n are the numbers of multiple occurrences of all amino acids or dipeptides in the two sequence sets, respectively. GO terms define a vocabulary of molecular functions, biological processes, and cellular components. We examined GO terms linked to molecular functions of translation.
Annotations were mapped onto the domain timeline. The DALI server performs pairwise comparisons to PDB90 based on a systematic branch-and-bound search that returns non-overlapping solutions in decreasing order of alignment Z-scores. Selected subsets of structural neighbors were visualized in multiple 3D superpositions for visualization of structural and sequence conservation. Evolutionary accretion of domains in aaRS enzymes. One of nine most parsimonious phylogenomic tree reconstructions describing the history of the aaRS protein domains analyzed in this study.
Terminal leaves are colored according to aaRS class class I, blue leaves; class II, coral red leaves and indexed with aaRS domains labeled with concise classification strings ccs. The tree matches the corresponding subtree in the global tree of FFs described in the next panel. Terminal leaves are not labeled in the tree since they would not be legible. The Venn diagram shows occurrence of FFs in the three superkingdoms. Evolutionary timeline of domain innovation. Domain ages arrowheads are mapped along a timeline of FF domain appearance derived from the global phylogenetic tree of FFs.
For reference, the timeline is indexed with landmarks derived from domain history  , . Dashed black lines indicate aaRS history prior to the appearance of the first accessory domain in the structure. A molecular clock of domain structures places the relative timeline in a geological time scale in billions of years Gy . The SerRS, LysRS, and MetRS enzymes lack distinct editing domains and probably hydrolyze misactivated amino acids in the active site of the catalytic domain  — .
TyrRS lacks internal editing functions  but contains a short connecting segment that is homologous to the CP1 editing domain of LeuRS, which harbors species-specific acceptor helix recognition properties . Val mimics the hydrophobic qualities of cognate Ileu and fits in the catalytic pocket of the IleRS enzyme . Similarly, Thr is misactivated by ValRS since both Thr and Val have similar physical and chemical properties and the same barrel structure provides hydrolytic activities.
In general, amino acid sieving functions of editing domains appear before recognition of identity elements in the anticodon arm of tRNA by anticodon-binding domains. Accretion encompasses over 2. The origin and evolution of the standard genetic code. Distribution of age groups of domains with editing 1 , 2 and 3 and anticodon-binding A , B and C functions, groups in exchange graphs, and active site participation in Venn diagrams of amino acids describing their physicochemical properties.
Venn diagrams show that the origin of amino acid charging in Group 1 specificities was associated with a polar, turn-inducing and active-site promoting amino acid Ser and hydrophobic aromatic Tyr and aliphatic Leu counterparts. In turn, the start of genetic encoding was associated with small turn-inducing amino acids. We note however that ancient Groups 1 and 2 domains charge amino acids with low active site-participation frequencies, the only exception being Ser, while Group 3 exhibits the opposite trend. The origin of the standard genetic code derived from expansion groups A, B and C Figure S2 was associated with small and hydrophobic amino acids, supporting early protein links to membrane environments .