Characterization of natural selection on gene duplicates
Specific Aims
Gene duplication is an important source of evolutionary novelty. Various mechanisms were proposed to explain how duplicates could be retained in the genome, and there exist numerous empirical examples for each. Recently, researchers developed a phylogenetic approach that classifies duplicates based on their expression profiles, making it possible to distinguish among different retention mechanisms at genome level. However, little is known about the role of natural selection on gene duplicates, particularly the younger copies. We propose to address this problem by characterizing natural selection acting on young duplicates in different species. First, we will use the data set on retention mechanisms of young Drosophila duplicates, and ask whether different types of selection act on young duplicates retained through different mechanisms. Next, we will apply the same classification method to plant duplicates in Poaceae lineage. We will examine how duplicates are retained in plant genome, and how selection drives their retention.
Aim 1. To investigate the types, targets and timing of natural selection that act on Drosophila young duplicates.
We will first estimate the protein sequence evolutionary rate by computing
Ka/Ksratios of young Drosophila duplicates. We will implement HKA test to determine the type of selection acting on young Drosophila duplicates retained through different mechanisms. We hypothesize that positive selection drives neofunctionalization, negative selection maintains conservation, and a combined action of positive selection and relaxed constraint underlie specialization. We will apply HKA test on different genic regions, and will expect similar patterns of the distribution of HKA statistics on all regions. Next, well will apply a tree-based method RELAX to evaluate the timing of selection. We expect that the strength of selection will be strongest on branches soon after duplication. Lastly, we will examine the functions that are enriched in young duplicates that underwent recent selective sweep.
Aim 2. To classify plant duplicates into different retention mechanisms and compare their sequence evolutionary rates.
We will first identify duplicate pairs as those for which two copies are present in one species and one copy is present in two sister species as well as two out-group species. For each pair, the copy that is orthologous to the ancestral single-copy gene will be assigned as old duplicate, whereas the other copy will be assigned as young duplicate. We will then utilize CDROM to classify duplicate retention mechanisms in these plant species. We will decide whether there is a prevalent retention mechanism in these species, or the prominent mechanism might differ from one species to another. We hypothesize that most plant duplicates are preserved through conservation. Moreover, we will compare the
Kavalues among duplicates retained through different mechanisms. We expect to see elevated
Kavalues in duplicates acquiring new functions compared to duplicates preserving old ancestral functions.
Aim 3. To assess the association between selection and its functional result in plant duplicates.
We will compare tissue specificity in young plant duplicates to further confirm our classification. We expect that duplicates with new functions will have higher tissue specificity than those with old functions. We will then use TopGO to find the enriched functions in young duplicates. We will expect most of them to be related with reproduction. Next, we will identify duplicates with signals of recent selective sweep using
nSL. We will look into the functions of the duplicates that underwent recent positive selection and examine whether they are the same as the enriched functions. We hypothesize that functions that are enriched in young duplicates are also the functions that have been selected for.
Significance
Gene duplication provides raw material for natural selection to act on upon, thus plays a key role in evolutionary novelty [1]. In most scenarios, due to the redundancy created by duplication, one copy will undergo relaxed selective constraint [1], resulting in an accumulation of deleterious mutations and eventually pseudogenization within a few millions of years [2]. However, numerous functional duplicates are present in eukaryotic genomes, many for hundreds of millions of years [3-9]. Thus, four mechanisms were proposed to explain the long-term retention of gene duplicates (Figure 1). The first one is conservation, in which both copies maintain the ancestral function after duplication, usually due to the benefits of increased dosage [1, 9]. A second mechanism is subfunctionalization, in which the ancestral function is divided between the two copies [10, 11]. A third mechanism is neofunctionalization, in which one copy preserve the ancestral function while the other acquire a novel function [1]. The forth is specialization, in which rapid subfunctionalization is followed by neofunctionalization, and result in two copies with distinct functions [12, 13].
Although these mechanisms may explain how duplicates are retained in the genome, their relative abundance, as well as the role of natural selection remains unclear. To determine the functional role of these mechanisms, one key step is to identify and compare the duplicates retained through different mechanisms, especially at genome scale. Previously, researchers developed a phylogenetic approach that compare distances between expression profiles of duplicates in one species and their ancestral single-copy genes in a sister species, and classify the duplicates into different mechanisms [14]. The underlying assumption is that changes in spatial expression profiles would represent changes in gene function. The application of their method to Drosophila duplicates demonstrated that most duplicate genes were retained by neofunctionalization, and that neofunctionalization almost always occur in the young duplicates [14]. In this study, we seek to characterize selective forces acting on those young Drosophila duplicates. In particular, we will investigate the type, target and timing of selection. We will also assess the association between selection and its functional result.
Plants have great occurrence of duplicates, thus are good model species to address questions on functional evolution of gene duplicates. We choose to study Poaceae, the grass family, which provide us food, feed, fuel and other benefits [15]. Poaceae belongs to the angiosperms (flowering plants) lineage, in which at most four whole genome duplications (WGDs) have been inferred since its rise 150-200 millions years ago (mya) [16-19]. One WGD occurred ~ 70 mya profoundly influenced Poaceae [20-23]. As a result, species in Poaceae group, including Brachypodium distachyon, rice (Oryza sativa) and sorghum (Sorghum bicolor), share a lot of homologs. However, most duplicates were already highly differentiated from one another when sorghum and rice diverged ~ 50 mya [24]. For the ones that are retained, of great interest are the mechanisms behind their retention, and the roles selection plays during the retention process. Thus, we will apply the same phylogenetic approach developed by Assis and Bachtrog [14] to duplicates in the three plant species and ask what is the most prevalent retention mechanism. Moreover, we will examine how selection leads to functional novelty in plant duplicates.
![]() Figure 1. Functional evolution after gene duplication. Gene duplication results in two copies of the same gene. In most cases, one copy will accumulate deleterious mutations and get pseudogenized (nonfunctionalization). When two copies are retained in the genome, they may both preserve the ancestral function (conservation), each acquires part of the ancestral function (subfunctionalization), have one copy preserve the old function while the other acquire a new one (neofunctionalization), or both copies acquire specialized function (specialization). |
Innovation
The method developed by Assis and Bachtrog [14] performs genome-wide classification of the mechanisms retaining gene duplicates. It uses expression profiles as proxies for gene function and classifies duplicates based on expression divergence between duplicates and single-copy genes in a closely related sister species. Thus, dataset generated by this method will provide novel insights into functional evolution of gene duplicates. In the first part of the proposed project, we will use the data on retention mechanisms of young Drosophila duplicates generated by Assis and Bachtrog [14]. We will perform sequence-based as well as polymorphism-based analysis (Aim 1) to interrogate the role of natural selection on the functional evolution of gene duplicates. Likewise, we will apply the same classification method to duplicates in three closely related plant species (Aim 2). The classification results will reveal the prevalent mechanisms that retain plant duplicates, and will serve as a new dataset. Thus, we can implement similar population-genetic approaches to examine the selective forces that underlie the retention of plant duplicates (Aim 3).
Approach
Aim 1. To investigate the types, targets and timing of natural selection that act on Drosophila young duplicates.
Aim 1.1. Elucidate the type and genic targets of natural selection on gene duplicates.
Aim 1.2. Determine whether selection is relaxed after gene duplication.
Aim 1.3. Examine whether selection is associated with the emergence of new function in gene duplicates.
Strategy, Aim 1.1. Elucidate the type and genic targets of natural selection on gene duplicates. We first estimated the nonsynonymous protein sequence evolutionary rate by calculating
Ka/Ksratios between orthologs in D. melanogaster and D. simulans. We found that
Ka/Ksratios of the young duplicates are significantly elevated than those of the single-copy genes. The result suggests that regardless of the retention mechanism, young duplicates experience increased evolutionary rate compared with single-copy genes [25]. However, we encountered two problems with
Ka/Ksratio test. First, due to the small sample size we have, we don’t have much power to distinguish protein sequence evolutionary rates among duplicates retained through different mechanisms. Second,
Ka/Ksratio can only be calculated in protein coding regions.
To address these issues, we will utilize a modified Hudson-Kreitman-Aguadé (HKA) test [26]. The original test uses a
Χ2statistic to compare expected and observed numbers of substitutions as well as polymorphisms between designated neutral and non-neutral regions. We will modify the test in two ways. First, we will use a sliding window approach, in which we will slide over the D. melanogaster genome, comparing the substitution-to-polymorphism ratio between the window we select and the rest of the genome. Second, following the approach of Huber et al. [27], we will polarize
Χ2to indicate the type of selection. Specifically, when there is an excess of substitution, which indicates positive selection, we will assign a positive
Χ2score. And when there is an excess of polymorphisms, which indicates negative selection, we will assign a negative
Χ2score [28]. Our preliminary data using window size of 10,000 nucleotides and step size of 1 nucleotide is shown in Figure 2 [25]. In the coding region, the distribution of signed
Χ2scores of the single-copy genes is negatively biased [25]. The distribution of the signed
Χ2scores of conserved duplicate genes is also negatively biased, and more negative than those of the single-copy genes [25]. The distribution of the signed
Χ2of specialized duplicates is negatively biased as well, but are less negative than those of the first two classes [25]. The distribution of the signed
Χ2of neofunctionalized duplicates, however, is positively biased and are significantly higher than those of all three other classes [25]. Thus, the results in coding region support our hypothesis that negative selection maintains conservation, positive selection drives neofunctionalization and a combined action of positive selection and negative selection underlie specialization [25]. The pattern of the distribution of signed
Χ2scores is similar in 3’UTR and 5’UTR [25]. In intros, however, the pattern of the distribution of signed
Χ2scores is different [25].
One potential limitation of HKA test is that it cannot differentiate synonymous mutations from nonsynonymous mutations. To take the effect of the different types of mutation into account, we propose to apply McDonald-
![]()
Figure 2. Distributions of signed Χ2 scores of single-copy genes, conserved, specialized, and neofunctionalized young D. melanogaster duplicates. (A) coding regions, (B) introns, (C) 5’ UTRs, and (D) 3’ UTRs. Two-sample permutation tests were used to assess significant differences between each pair of distributions. * P<0.05, ** P<0.01, *** P<0.001. The dashed line indicates Χ2=0. (Taken from Jiang and Assis 2017 [25]) |
Kreitman (MK) test to our dataset [29]. MK test compares the ratio of nonsynonymous to synonymous polymorphism to the ratio of nonsynonymous to synonymous divergence. Similar with HKA test, an excess of divergence would indicate positive selection while an excess of polymorphism would indicate negative selection [29]. Following the approach of Andolfatto et al. [30], we will extend the MK test to non-coding regions and determine whether the pattern of the distribution of MK test is consistent with the ones of HKA test.
Strategy, Aim 1.2. Determine whether selection is relaxed after gene duplication. Next, we want to ask when did selection take place after gene duplication in Drosophila melanogaster. We will apply a tree-based method RELAX, which compares the distribution of
Ka/Ksratios between focal branches and reference branches. RELAX computes an intensity parameter
Kto indicate the strength of selection [31]. Under the null model when
K=1, there is no difference between focal branches and reference branches. Alternatively,
K<1would suggest that focal branches underwent relaxed selection while
K>1would suggest that focal branches underwent intensified selection [31]. RELAX then uses likelihood ratio test to choose the most probable model and gives
P-values that are associated with
Kvalues. We will apply RELAX to duplicates that arose after thedivergence of D. pseudoobscura and D. melanogaster, yet before the divergence of D. ananassae and D. melanogaster. Duplicates of this particular age group would allow us to track the changes in the strength of selection after gene duplication (Figure 3) [25]. For each RELAX run, we can set one branch as our focal branch and the rest of the tree as reference branches. Our initial results showed that on branches 1 and 2 most duplicates have
K>1, suggesting that natural selection acts quickly after duplication [25]. On branch 3,
Kvary but are generally smaller than those on branches 1 and 2 [25]. And for most duplicates on branch 4,
K<1[25]. The decreasing trend of
Ksuggests that the strength of selection is strongest soon after duplication and attenuates over time [25].
Though informative about the strength of selection, RELAX doesn’t distinguish between positive selection and negative selection [31]. To detect directional selection on particular branches, we propose to use PAML as a complementary approach [32]. The branch models of PAML would allow
to vary on different branch groups [32]. Thus, we can have different
values on test branch and reference branches. Comparisons of
values between test branch and reference branches will indicate what type of selection is acting on the branches soon after duplication.
Strategy, Aim 1.3. Examine whether selection is associated with the emergence of new function in gene duplicates. Last, we are interested in whether natural selection is linked to its functional results. We will apply
nSLto the phased haplotype data from D. melanogaster [33]. We will rank normalized
nSLscores on each chromosome arm, and then classify them into the top 5% and the rest. The ones with top 5% scores represent putative regions that underwent recent selective sweep [33]. Of the 108 pairs of duplicates, 25 young
![]() Figure 3. Assessment of the strength of natural selection at four post-duplication time points. (A) Phylogenetic tree of 12 sequenced Drosophila species, with red stars indicating test branches used for each of the four RELAX runs. (B) Intensity of selection Kfor RELAX analyses with test branches from (A). The lines connect the same pair of duplicates. The horizontal dashed line represents neutrality ( K=1). Red circles indicate P<0.05via the likelihood ratio test. (Taken from Jiang and Assis 2017 [25]) |
duplicates are within the top 5%, while 83 are not. And 33 old duplicates are within the top 5% while 75 are not [25]. Next, we will input those lists into GOrilla and look for functional enrichment [34, 35]. Our preliminary results show that only young duplicates with top 5%
nSLscores have functional enrichment [25]. All the GO terms are associated with reproduction, and over half of the genes encode seminal fluid proteins [25]. The results suggest that selection is associated with testis-related functions, but only in young duplicates [25].
We propose to use two more tests from Extended Haplotype Homozygosity (EHH) statistics family [36]. The first one is integrated Haplotype Score (iHS), which has most power when allele frequency is moderately high (50-80%) [37]. The second one is Cross Population Extended Haplotype Homozygosity (XP-EHH), which has most power when allele frequency is high (>80%) [38]. We will process the test results in a similar manner and look for functional enrichment in the duplicates with signatures of recent positive selection.
Aim 2. To classify plant duplicates into different retention mechanisms and compare their sequence evolutionary rates.
Aim 2.1. Identify duplicates in three species of the grass family.
Aim 2.2. Classify duplicate retention mechanisms based on their expression profile.
Aim 2.3. Compare nonsynonymous protein sequence evolutionary rates among duplicates retained through different mechanisms.
Strategy, Aim 2.1. Identify duplicates in three plant species of the Poaceae family. We will use gene family finder on PLAZA 3.0 database [39] to obtain duplicates in Brachypodium distachyon, Oryza sativa and Sorghum bicolor. For each pair, we will require the presence of exactly two copies in the species of interest, and one copy in the other two species, as well as two outgroup species (Musa acuminata and Arabidopsis thaliana). The assignment of old and young duplicates will be based on the orthology between the old copy and the ancestral single-copy gene since genomes of the species diverged after the duplication event contain an “old” copy that is orthologous to the ancestral single-copy gene and a “young” copy that is the product of the duplication event. We will use the orthologous groups (OGs) predicted by OrthoMCL [40], TribeMCL [41] and colinearity [42] from the PLAZA 3.0 database [39] to conduct the assignment. When all three methods support the orthology between one copy and the ancestral single-copy gene, we will assign that one to be the old copy, whereas the other to be the young copy. We will apply a majority voting scheme when three methods are available, yet conflicting. When only two methods are available and conflicting, our priority list goes as OrthoMCL, collinearity and TribeMCL. Assignment of orthologs is usually based on conserved genetic synteny. However, application of colinearity method can be problematic as a result of whole genome duplications in the evolutionary history of plants. Thus, we prefer OrthoMCL because it has the highest coverage on this dataset, and a good tradeoff between false-positives and false-negatives [43, 44]. We are able to get 72 pairs of duplicates in Brachypodium distachyon, 97 pairs in Oryza sativa, and 120 pairs in Sorghum bicolor with the methods described above. We will be using this dataset for all the preliminary analysis.
Strategy, Aim 2.2. Classify duplicate retention mechanisms based on their expression profile. We will use the classification method developed by Assis and Bachtrog [14], which assume that changes in spatial gene expression profiles represent changes in gene function. We choose to use expression profiles as proxies for function for three reasons. First, RNA-seq data are available in the same nine tissues in Brachypodium distachyon, Oryza sativa and Sorghum bicolor [45], enabling direct application of our method. And a major advantage of this dataset is that all experiments were performed under identical conditions, which is ideal when comparing data from different species. Second, expression profiles and differences between them are easily quantifiable and interpretable. Third, expression profiles correlate to other measures of gene function [46-50]. We will implement our method with CDROM [51], using the duplicate pairs, single-copy gene list and expression profiles as input. In Brachypodium distachyon, among the 72 pairs of duplicates, 39 are conserved, 22 are neofunctionalized, 8 are specialized and 2 are subfunctionalized. In Oryza sativa, among the 97 pairs of duplicates, 50 are conserved, 12 are neofunctionalized and 3 are specialized. In Sorghum bicolor, among the 120 pairs of duplicates, 79 are conserved, 30 are neofunctionalized, 5 are specialized and 1 is subfunctionalized. Thus, in all three species, conservation is the most common retention mechanism. This is consistent with previous findings that plant gene families are highly conserved over long evolutionary timescales [52].
Strategy, Aim 2.3. Compare nonsynonymous protein sequence evolutionary rates among duplicates retained through different mechanisms. To establish a baseline for the comparison of protein sequence evolutionary rate, we calculated the pairwise
KSvalues between all the orthologs in Brachypodium distachyon and Sorghum bicolor, as well as the pairwise
KSvalues between all the orthologs in Oryza sativa and Sorghum bicolor. Sorghum bicolor diverged from the common ancestor of Brachypodium distachyon and Oryza sativa 45 – 60 mya [45]. Under the assumption that the number of silent substitutions per site increases approximately linearly with time [53], the median of
KSvalues between orthologs in Brachypodium distachyon and Sorghum bicolor should be similar to that in Oryza sativa and Sorghum bicolor. As expected, the median of
KSin Brachypodium distachyon and Sorghum bicolor (0.89) is close to the median of
KSin Oryza sativa and Sorghum bicolor (0.91). Thus, the ortholog table as well as the sequence data on PLAZA 3.0 should be dependable.
Next, we will use PAML [32] to estimate the
Ka as well as
Ka/Ksratios between duplicates and their ancestral single-copy genes. We further divided the classified duplicates into two categories: the ones that preserve the ancestral function and the ones that acquire novel function. Specifically, for old duplicates, conserved duplicates and the ones classified as neofunctionalized child would be considered preserving old function, while specialized duplicates and the ones classified as neofunctionalized parent would be considered acquiring new function. To obtain some preliminary results, we first calculated the
Kavalues. We found that Brachypodium distachyon duplicates with novel function have significantly higher
Kavalues than those with old function when ancestral single-copy genes in Sorghum bicolor are used. In other cases, the median of
Kavalues are elevated in copies acquiring novel function than copies maintaining old functions, yet the differences are not significant. As a next step, we will also compare the
Ka/Ksratios among duplicates retained through different mechanisms. Our method is unorthodox given that the ancestral single-copy gene is not orthologous to the young duplicate. An alternative approach is to estimate the branch length of the young duplicate using another maximum-likelihood based method PhyML [54]. Thus, we will be able to compare the distribution of branch lengths across different classes.
Aim 3. To assess the association between selection and its functional result in plant duplicates.
Aim 3.1. Compare functions of duplicates in different classes.
Aim 3.2. Discover tissues that are associated with the emergence of new functions.
Aim 3.3. Determine whether positive selection is linked with novel functions.
Strategy, Aim 3.1. Compare functions of duplicates in different classes. We will examine three more metrics of gene function to determine whether the patterns are consistent with our classifications. In particular, we will compare tissue specificity, number of protein-protein interactions and relative expression levels across tissues among different retention mechanisms [55, 56]. To get preliminary results, we compared tissue specificity using
τvalues (Figure 4) [58]. In Brachypodium distachyon, young duplicates underwent neofunctionalization and specialization show significantly higher
τvalues compared with genes underwent conservation. In Oryza sativa, young duplicates classified as neofunctionalized copies show significantly higher
τvalues than the ones classified as conserved copies. In Sorghum bicolor, neofunctionalized and specialized young duplicates have significantly larger values than conserved young duplicates. Overall, young duplicates classified as acquiring new functions have higher tissue specificity compared with young duplicates classified as preserving old functions.
![]() τ of conserved, neofunctionalized child, neofunctionalized parent and specialized young duplicates. (A) Brachypodium distachyon, (B) Oryza sativa, (C) Sorghum bicolor. The dashed line represents the median τvalue of single-copy genes in the depicted species. * P<0.05, ** P<0.01, *** P<0.001. |
Strategy, Aim 3.2. Discover tissues that are associated with the emergence of new functions. Next, we want to investigate what are the new functions, and what tissues are associated with the emergence of new function. We will rank the duplicates based on their tissue specificity and keep the top five candidates in neofunctionalized and specialized categories. For each candidate, we can look at its function and then identify the tissue with the highest absolute expression level. To further assess their functional enrichment, we will apply TopGO to our duplicates in neofunctionalized and specialized category [57]. A major advantage of TopGO is that it allows custom functional annotation [57]. Thus, we can input the ontology files we obtained from PLAZA 3.0 [39], and use TopGO merely as a statistical tool. We will compare the enriched function we get from TopGO to the function of genes with high tissue specificity, and determine whether the enriched functions can correlate to the functions with high tissue specificity.
Strategy, Aim 3.3. Determine whether positive selection is linked with novel functions. To directly link natural selection and its functional result, we will apply
nSLto the phased polymorphism data [33], and look for regions that underwent recent positive selection. We will first obtain polymorphism data for all three species from EnsemblPlants [60]. We choose
nSLfor several reasons. First, it has higher power than other haplotype-based methods under most selection scenarios and parameter values. Second, it is robust to recombination estimation errors because it uses number of segregating sites as proxy for distance. Third, it is also robust to variation in demographic factors [33]. We will rank normalized
nSLscores for each chromosome, and look for functions of genes with the top 5%
nSLscores. We will then compare the functions that have been selected for, to the functions that are enriched in young duplicates. Based on our preliminary data from Drosophila duplicates, we hypothesize that young duplicates with evidence of recent positive selection will be enriched in reproduction-related functions. However, the conclusion may differ from one species to another, and enable comparisons among the three closely related plant species.
Reference
[1] S. Ohno, Evolution by gene duplication. Springer Science & Business Media, 1970.
[2] M. Lynch and J. S. Conery, “The evolutionary fate and consequences of duplicate genes,” Science, vol. 290, no. 5494, pp. 1151-1155, 2000.
[3] S. D. Ferris and G. S. Whitt, “Evolution of the differential regulation of duplicate genes after polyploidization,” Journal of Molecular Evolution, vol. 12, no. 4, pp. 267-317, 1979.
[4] L. G. Lundin, “Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse,” Genomics, vol. 16, no. 1, pp. 1-19, 1993.
[5] A. Sidow, “Gen (om) e duplications in the evolution of early vertebrates,” Current Opinion in Genetics & Development, vol. 6, no. 6, pp. 715-722, 1996.
[6] J. F. Brookfield, “Genetic redundancy,” Advances in Genetics, vol. 36, no. C, pp. 137-155, 1997.
[7] J. H. Nadeau and D. Sankoff, “Comparable rates of gene loss and functional divergence after genome duplications early in vertebrate evolution,” Genetics, vol. 147, no. 3, pp. 1259-1266, 1997.
[8] J. H. Postlethwait et al., “Vertebrate genome evolution and the zebrafish gene map,” Nature Genetics, vol. 18, no. 4, pp. 345-349, 1998.
[9] J. Zhang, “Evolution by gene duplication: an update,” Trends in Ecology & Evolution, vol. 18, no. 6, pp. 292-298, 2003.
[10] A. Force, M. Lynch, F. B. Pickett, A. Amores, Y.-l. Yan, and J. Postlethwait, “Preservation of duplicate genes by complementary, degenerative mutations,” Genetics, vol. 151, no. 4, pp. 1531-1545, 1999.
[11] A. Stoltzfus, “On the possibility of constructive neutral evolution,” Journal of Molecular Evolution, vol. 49, no. 2, pp. 169-181, 1999.
[12] X. He and J. Zhang, “Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution,” Genetics, vol. 169, no. 2, pp. 1157-1164, 2005.
[13] S. Rastogi and D. A. Liberles, “Subfunctionalization of duplicated genes as a transition state to neofunctionalization,” BMC Evolutionary Biology, vol. 5, no. 1, p. 28, 2005.
[14] R. Assis and D. Bachtrog, “Neofunctionalization of young duplicate genes in Drosophila,” Proceedings of the National Academy of Sciences, vol. 110, no. 43, pp. 17409-17414, 2013.
[15] J. E. Bowers, B. A. Chapman, J. Rong, and A. H. Paterson, “Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events,” Nature, vol. 422, no. 6930, p. 433, 2003.
[16] D. E. Soltis, C. D. Bell, S. Kim, and P. S. Soltis, “Origin and early evolution of angiosperms,” Annals of the New York Academy of Sciences, vol. 1133, no. 1, pp. 3-25, 2008.
[17] D. E. Soltis et al., “Polyploidy and angiosperm diversification,” American journal of botany, vol. 96, no. 1, pp. 336-348, 2009.
[18] H. Tang, J. E. Bowers, X. Wang, R. Ming, M. Alam, and A. H. Paterson, “Synteny and collinearity in plant genomes,” Science, vol. 320, no. 5875, pp. 486-488, 2008.
[19] O. Jaillon et al., “The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla,” nature, vol. 449, no. 7161, p. 463, 2007.
[20] X. Wang, X. Shi, B. Hao, S. Ge, and J. Luo, “Duplication and DNA segmental loss in the rice genome: implications for diploidization,” New Phytologist, vol. 165, no. 3, pp. 937-946, 2005.
[21] J. Yu et al., “The genomes of Oryza sativa: a history of duplications,” PLoS biology, vol. 3, no. 2, p. e38, 2005.
[22] A. H. Paterson, J. E. Bowers, D. G. Peterson, J. C. Estill, and B. A. Chapman, “Structure and evolution of cereal genomes,” Current opinion in genetics & development, vol. 13, no. 6, pp. 644-650, 2003.
[23] A. Paterson, J. Bowers, and B. Chapman, “Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 26, pp. 9903-9908, 2004.
[24] A. Paterson and J. Bowers, “TheSorghum bicolor genome and the diversification of grasses,” nature, vol. 457, no. 7229, pp. 551-556, 2009.
[25] X. Jiang and R. Assis, “Natural selection drives rapid functional evolution of young Drosophila duplicate genes,” Molecular Biology and Evolution, 2017.
[26] R. R. Hudson, M. Kreitman, and M. Aguadé, “A test of neutral molecular evolution based on nucleotide data,” Genetics, vol. 116, no. 1, pp. 153-159, 1987.
[27] C. D. Huber, M. DeGiorgio, I. Hellmann, and R. Nielsen, “Detecting recent selective sweeps while controlling for mutation rate and background selection,” Molecular Ecology, vol. 25, no. 1, pp. 142-156, 2016.
[28] B. Charlesworth and D. Charlesworth, Elements of evolutionary genetics. Roberts and Company Publishers Greenwood Village, 2010.
[29] J. H. McDonald and M. Kreitman, “Adaptive protein evolution at the Adh locus in Drosophila,” Nature, vol. 351, no. 6328, p. 652, 1991.
[30] P. Andolfatto, “Adaptive evolution of non-coding DNA in Drosophila,” Nature, vol. 437, no. 7062, pp. 1149-1152, 2005.
[31] J. O. Wertheim, B. Murrell, M. D. Smith, S. L. K. Pond, and K. Scheffler, “RELAX: detecting relaxed selection in a phylogenetic framework,” Molecular Biology and Evolution, pp. 32:820-832, 2014.
[32] Z. Yang, “PAML 4: a program package for phylogenetic analysis by maximum likelihood,” Molecular Biology and Evolution, vol. 24, pp. 1568-1591, 2007.
[33] A. Ferrer-Admetlla, M. Liang, T. Korneliussen, and R. Nielsen, “On detecting incomplete soft or hard selective sweeps using haplotype structure,” Molecular Biology and Evolution, vol. 31, no. 5, pp. 1275-1291, 2014.
[34] E. Eden, R. Navon, I. Steinfeld, D. Lipson, and Z. Yakhini, “GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists,” BMC bioinformatics, vol. 10, no. 1, p. 48, 2009.
[35] E. Eden, D. Lipson, S. Yogev, and Z. Yakhini, “Discovering motifs in ranked lists of DNA sequences,” PLoS Computational Biology, vol. 3, no. 3, p. e39, 2007.
[36] P. C. Sabeti et al., “Detecting recent positive selection in the human genome from haplotype structure,” Nature, vol. 419, no. 6909, pp. 832-837, 2002.
[37] B. F. Voight, S. Kudaravalli, X. Wen, and J. K. Pritchard, “A map of recent positive selection in the human genome,” PLoS biology, vol. 4, no. 3, p. e72, 2006.
[38] P. C. Sabeti et al., “Genome-wide detection and characterization of positive selection in human populations,” Nature, vol. 449, no. 7164, pp. 913-918, 2007.
[39] S. Proost et al., “PLAZA 3.0: an access point for plant comparative genomics,” Nucleic acids research, vol. 43, no. D1, pp. D974-D981, 2014.
[40] L. Li, C. J. Stoeckert, and D. S. Roos, “OrthoMCL: identification of ortholog groups for eukaryotic genomes,” Genome research, vol. 13, no. 9, pp. 2178-2189, 2003.
[41] A. J. Enright, S. Van Dongen, and C. A. Ouzounis, “An efficient algorithm for large-scale detection of protein families,” Nucleic acids research, vol. 30, no. 7, pp. 1575-1584, 2002.
[42] J. Fostier et al., “A greedy, graph-based algorithm for the alignment of multiple homologous gene lists,” Bioinformatics, vol. 27, no. 6, pp. 749-756, 2011.
[43] F. Chen, A. J. Mackey, J. K. Vermunt, and D. S. Roos, “Assessing performance of orthology detection strategies applied to eukaryotic genomes,” PloS one, vol. 2, no. 4, p. e383, 2007.
[44] M. Van Bel et al., “Dissecting plant genomes with the PLAZA comparative genomics platform,” Plant physiology, p. pp. 111.189514, 2011.
[45] R. M. Davidson et al., “Comparative transcriptomics of three Poaceae species reveals patterns of gene expression evolution,” The Plant Journal, vol. 71, no. 3, pp. 492-502, 2012.
[46] H. Ge, Z. Liu, G. M. Church, and M. Vidal, “Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae,” Nature genetics, vol. 29, no. 4, p. 482, 2001.
[47] X. Zhou, M.-C. J. Kao, and W. H. Wong, “Transitive functional annotation by shortest-path analysis of gene expression data,” Proceedings of the National Academy of Sciences, vol. 99, no. 20, pp. 12783-12788, 2002.
[48] N. Bhardwaj and H. Lu, “Correlation between gene expression profiles and protein–protein interactions within and across genomes,” Bioinformatics, vol. 21, no. 11, pp. 2730-2738, 2005.
[49] L. French and P. Pavlidis, “Relationships between gene expression and brain wiring in the adult rodent brain,” PLoS computational biology, vol. 7, no. 1, p. e1001049, 2011.
[50] R. Assis and A. S. Kondrashov, “Conserved proteins are fragile,” Molecular Biology and Evolution, vol. 31, no. 2, pp. 419-424, 2014.
[51] B. R. Perry and R. Assis, “CDROM: Classification of duplicate gene retention mechanisms,” BMC evolutionary biology, vol. 16, no. 1, p. 82, 2016.
[52] S. A. Rensing et al., “The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants,” Science, vol. 319, no. 5859, pp. 64-69, 2008.
[53] G. Blanc and K. H. Wolfe, “Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes,” The Plant Cell, vol. 16, no. 7, pp. 1667-1678, 2004.
[54] S. Guindon, J.-F. Dufayard, V. Lefort, M. Anisimova, W. Hordijk, and O. Gascuel, “New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0,” Systematic biology, vol. 59, no. 3, pp. 307-321, 2010.
[55] I. Yanai et al., “Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification,” Bioinformatics, vol. 21, no. 5, pp. 650-659, 2004.
[56] D. Szklarczyk et al., “STRING v10: protein–protein interaction networks, integrated over the tree of life,” Nucleic acids research, vol. 43, no. D1, pp. D447-D452, 2014.
[57] A. Alexa and J. Rahnenfuhrer, “topGO: Enrichment Analysis for Gene Ontology,” ed. R package version 2.28.0,