Phylogenetic relationships among seed plants: Persistent questions and the limits of molecular data†
The author thanks S. Renner and one anonymous reviewer for suggestions for improvements of this manuscript.
Abstract
Trees inferred from DNA sequence data provide only limited insight into the phylogeny of seed plants because the living lineages (cycads, Ginkgo, conifers, gnetophytes, and angiosperms) represent fewer than half of the major lineages that have been detected in the fossil record. Nevertheless, phylogenetic trees of living seed plants inferred from sequence data can provide a test of relationships inferred in analyses that include fossils. So far, however, significant uncertainty persists because nucleotide data support several conflicting hypotheses. It is likely that improved sampling of gymnosperm diversity in nucleotide data sets will help alleviate some of the analytical issues encountered in the estimation of seed plant phylogeny, providing a more definitive test of morphological trees. Still, rigorous morphological analyses will be required to answer certain fundamental questions, such as the identity of the angiosperm sister group and the rooting of crown seed plants. Moreover, it will be important to identify approaches for incorporating insights from data that may be accurate but less likely than sequence data to generate results supported by high bootstrap values. How best to weigh evidence and distinguish among hypotheses when some types of data give high support values and others do not remains an important problem.
Living seed plants comprise the cycads, Ginkgo, conifers, gnetophytes (together, extant gymnosperms), and angiosperms. Extinct gymnosperms that cannot be assigned to living groups include hydraspermans, medullosans, peltasperms, glossopterids, Caytonia, Pentoxylon, Callistophyton, corystosperms (all these are often referred to as pteridosperms or seed ferns), Bennettitales (sometimes referred to as cycadeoids), Erdtmanithecales, Cordaitales, Paleozoic and Mesozoic conifers, and ginkgophytes. Seed plant diversity is great enough, and the surviving lines divergent enough, that there have been those who hesitated or were unwilling to include them in a single lineage (e.g., 17; 3). 3, p. 3) took fellow botanists to task for being “completely satisfied to group together quite unrelated plants” based on the character of the seed alone. He and others placed seed plants in at least three groups that were thought to be linked with different groups of free-sporing plants: angiosperms, cycadophytes (“seed ferns,” cycads, Bennettitales) and coniferophytes (Cordaitales, ginkgos, conifers, with or without gnetophytes). 17 included gnetophytes in coniferophytes, while 3 placed them in a separate group, the Chlamydospermophytes. Discussions of seed plant origins shifted focus after the startling discovery of a connection between Archaeopteris (fragments of fern-like fronds from the Devonian) and Callixylon (permineralized twigs, branches, and trunks with wood that linked them with gymnosperms), leading to the recognition of progymnosperms (e.g., 5, 6, 7). Beck hypothesized a diphyletic origin of seed plants from progymnosperms, arguing that cycadophytes and coniferophytes likely arose from different progymnosperms in the order Aneurophytales (6, 7; 88; see also 9). 80 argued for a monophyletic origin from an aneurophytalean ancestor, with both coniferophytes and cycadophytes being derived from within hydrasperman seed plants.
The question of whether seed plants are monophyletic remains open to this day. It can only be partially tested with sequence data, despite statements by molecular systematists who claim that seed plant monophyly has been clearly confirmed by molecular phylogenetic studies that include both seed and free-sporing plants (e.g., 74). Sequence data could refute monophyly by placing seed plants with different groups of living free-sporing plants, but they are powerless to distinguish between the hypotheses proposed by 6, 7) and 80. To do so requires a matrix of morphological data that includes all possible representatives of the closest relatives of seed plants (progymnosperms), representatives of all seed plant lineages, living and extinct, as well as an ample diversity of lycophytes and ferns. A maximum of three progymnosperms have been included in previous phylogenetic analyses, one of which (Cecropsis) can be scored only for the anatomy and organization of the fertile shoot system (e.g., 81; 52). In one of these studies (81), lycophytes, trimerophytes, equisetalean and filicalean ferns were included in a preliminary analysis from which was inferred a hypothetical ancestor, which was then included to root the seed plant phylogeny. In the other study (52), lycophytes and ferns were not included; a progymnosperm (Tetraxylopteris) was designated as the outgroup. No criticism is intended in these observations. It is difficult to obtain the needed data because fossils are fragmentary or remain uncharacterized and because it is challenging to assess homology of morphological characters in both living and extinct taxa across seed and free-sporing plants.
Relationships within seed plants also remain ambiguous. Morphological analyses have not supported the cycadophyte concept (21; 31; 68; 81; 27, 29, 23; 52). These studies found “seed ferns” to be polyphyletic, consistent with their extreme heterogeneity and the wide range of sophistication in their reproductive structures, and failed to unite cycads and Bennettitales. Coniferophytes also receive little support in results from morphological analyses, although several of the inferred phylogenetic trees include a clade that unites fossil and living conifers with Cordaitales (e.g., 21; 68; 81; 52). DNA sequence data can provide only limited insight into the question. The living lines are almost certainly more closely related to various extinct groups than to each other, particularly in the cases of cycads and angiosperms (e.g., Fig. 1). Nevertheless, trees from sequence data can refute relationships inferred in analyses that include fossils. For example, trimming fossils from the optimal trees inferred in recent morphological analyses (30; 52) would leave the living taxa united as depicted in Fig. 2, with angiosperms nested in gymnosperms, united with gnetophytes, and with cycads sister to all other seed plants. This hypothesis apparently is refuted by analyses of sequence data. Instead, molecular trees differ in a way that highlights three persistent and long-debated phylogenetic questions: What is the sister group of the angiosperms? What is the position of the gnetophytes? What is the rooting of the crown seed plants (spermatophytes sensu 15)?
THE SISTER GROUP OF THE ANGIOSPERMS
In a 1960 speech on the origin of angiosperms, T. M. Harris asked his listeners “to look back, not on a proud record of the success of famous men, but on an unbroken record of failure” (8, p. 1 ). Writing 16 years later, Beck's analysis of progress toward understanding angiosperms was considerably more optimistic. Nonetheless, he was writing at a time when the timing of their origin was more controversial than it is today, when the identities of the earliest diverging members were obscure, when the place and habitat of origin were more controversial, when angiosperm monophyly remained to be tested in phylogenetic analyses, and when not all agreed that the angiosperm sister group was to be found among the gymnosperms. Significant advances have been achieved on all of these fronts (21; 31; 62, 10; 71; 75; 44; 38, 12; 60), due in large part to the advent of molecular systematics and the development of computational approaches and resources. The question that persists concerns the relationship of angiosperms to other seed plants.
The tree in Fig. 2 is compatible with the anthophyte concept as articulated by 32 for a clade of taxa with aggregations of sporophylls that were interpreted as flower-like. The clade included angiosperms, gnetophytes, Bennettitales, and Pentoxylon (e.g., 21; 31, 33; 68; 81), or in an expanded version, it also included glossopterids and Caytonia in a clade referred to as glossophytes (27, 23; 52). Nearly all analyses of DNA sequence data contradict the concept of anthophytes or glossophytes by failing to resolve gnetophytes either as paraphyletic or as sister to the angiosperms. The exceptions are maximum parsimony (MP) or neighbor-joining (NJ) trees inferred from nuclear ribosomal DNA (rDNA; 86; 83; but see 19, and 12, fig. 2) or RNA (rRNA; 48), and in one case, from rbcL (82). These exceptional trees unite gnetophytes and angiosperms, but without even moderate bootstrap support. Rather, a highly supported topology from analyses of sequence data (10; 18; 67; 46; 85; 12) is shown in Fig. 3A. Not only are the gnetophytes nested within conifers (discussed next), but angiosperms and extant gymnosperms are each resolved as monophyletic, suggesting that angiosperms have no close relatives among living gymnosperms.
THE POSITION OF GNETOPHYTES
“The Gnetales, like Minerva, seem to have sprung, full armed, from the head of Jove.” — 17, p. 433)
Given such a viewpoint, perhaps Chamberlain would not have been surprised when the results from analyses of sequence data suggested that gnetophytes had sprung from conifers (Fig. 3A; 10; 18; 67; 46). However, amid a community that had largely embraced anthophytes, the results were surprising (e.g., 70). Even botanists who were more familiar with characters that suggested a link with conifers or who argued that putative synapomorphies for angiosperms and gnetophytes were homoplasies (e.g., 57) greeted the idea that gnetophytes had sprung from within conifers with caution (e.g., 25). Conifer monophyly is apparently supported by a number of synapomorphies, including resin canals, tiered proembryos, single copy condition of the plastid inverted repeat, and the ovulate cone scale (17; 21; 49; 78; 25). Nevertheless, trees from sequence data have consistently united gnetophytes with Pinaceae in a highly supported “gnepine” clade and placed gnepines as sister to a clade of the other conifer families (Cupressophyta sensu 15). There are notable, well-supported, exceptions, and in this sense, the results from sequence analyses extend rather than resolve the puzzle surrounding the position of the gnetophytes that has persisted through the years (1, 2; 91; 90; 17; 4; 36; 68; 27). One of these is depicted in Fig. 3B, which resolves gnetophytes as sister to all other seed plants. This topology is well supported in certain analyses, mostly of concatenated data sets. However, the topology is rarely supported in maximum likelihood (ML) analyses or in parsimony analyses that exclude faster-evolving sites (e.g., 83; 12; 47; for exceptions, see 13, and 77), and it may possibly result from error in reconstruction (84; 14). While the gnepine hypothesis remains controversial, a link between conifers, gnetophytes, and Ginkgo was implicit in Chamberlain's (1935) placement of gnetophytes in coniferophytes (although not without reservation [17, p. 433]). Conifers and gnetophytes share linear leaves, reduced sporophylls, and circular bordered pits with tori in the protoxylem, and together with Ginkgo, they uniquely share metaxylem that lacks scalariform pitting (4; 9; 16; 27). Thus, a clade in which monophyletic conifers are sister to monophyletic gnetophytes (referred to as a “gnetifer” clade) apparently would be consistent with other lines of evidence. However, gnetifer trees have rarely been inferred in molecular analyses (exceptions are in 19; 82; 47; 13).
THE ROOTING OF THE CROWN SEED PLANTS
“A position of the root between the cycad and Ginkgo nodes might be very difficult to detect, because this branch is so short compared to the long branches to angiosperms and Gnetales.” — 25, p. R108)
Both angiosperms and gnetophytes are nested well within trees that include living and fossil taxa (Fig. 1), whereas the best-supported rootings of molecular trees are along the branches to angiosperms (Fig. 3A) or gnetophytes (Fig. 3B). These are two of the longest (if not the longest) branches in most molecular trees (see 43, pp. 216–227 in this issue); conversely, the branch between the cycad and Ginkgo nodes is very short in trees that do unite these branches in a clade. The concern voiced by Donoghue and Doyle in the opening quote is that a long branch from the outgroup may be unlikely to attach to such a short branch. Consistent with this, there is evidence that the rooting along the gnetophyte branch may result from long-branch attraction (84; 14). Both trees imply that the first dichotomy in the seed plant phylogeny splits angiosperms (or gnetophytes) from all other extant seed plants, which is inconsistent with currently available stratigraphic evidence (28).
ISSUES WITH DNA SEQUENCE DATA
It would be an oversimplification to say that these questions remain unresolved as a result of conflict between molecular and morphological data; there is ambiguity in both types of data. On the one hand, 30 found that morphological trees placing gnetophytes within conifers (although not with Pinaceae) are just one step longer than the most parsimonious trees, which are anthophyte trees, but neither of these results is robust. On the other hand, a single clear signal has not emerged from molecular studies. Although there have been several efforts to sample multiple loci and/or concatenate data from previously published seed plant studies to increase the number of characters and loci analyzed (e.g., 10; 18; 67; 46; 83; 85; 76; 12; 47), consensus remains elusive. Exploration of some of these data sets has identified several factors that may result in erroneous trees, including high taxonomic sampling error (due to extinctions), saturation at nucleotide sites (due to the age of divergence among major clades), high rate variation across sites and across clades, conflicting signal within and among genetic loci that are used as phylogenetic markers (e.g., 18; 84; 59; 83; 85; 12, 13; 47), and error and bias in phylogenetic reconstruction (84; 14). One effective approach for reducing conflicting signal in single and concatenated data sets is to bin sites based on estimated rates of evolution and to experiment with removing different rate classes (12; see also 79). For example, 12 found that removal of fast-evolving positions from a 13-locus concatenated seed plant data set resulted in convergence of both MP and ML on a gnepine tree, an apparent resolution of the conflict between results from parsimony analyses of all sites, which favored gnetophytes as sister to all seed plants, and likelihood analyses of the same, which favored gnepine trees. However, this does not mean that the gnepine tree is correct, only that one signal is enhanced and the other is dampened when rapidly evolving sites are removed. Both signals cannot be correct, but both may be erroneous. Intuitively, removing noisy sites that may hinder resolution of the question of interest makes sense, but because there is evidence of bias in both slowly and rapidly evolving sites (14), reducing noise does not necessarily reduce error. An additional, potentially confounding factor is heterotachy, or shifts in site-specific rates of evolution across time. Heterotachous sites are likely to exist in seed plant data sets and their presence and effects should be explored.
TAXONOMIC SAMPLING
The best analytical approaches yield limited insight when too few taxa are sampled. Analyses of sequence data from seed plants have included very few extant gymnosperms, fewer than half of the genera and 6% of the species. Most of the highly cited seed plant studies have included 10, 11, 19, or 21 of ∼1100 gymnosperms in 85 genera (10; 18; 46; 83; 85; 76; 12). The negative effects of the factors just outlined on phylogenetic accuracy are likely to be exacerbated when taxonomic sampling is so limited, even when using appropriate models of nucleotide evolution, removing certain classes of sites, and using analytical approaches that are more robust to error. Increasing taxa can increase accuracy (e.g., 50, 28; 45; 87; 72) and the efficiency with which a method converges on an accurate tree (e.g., 54). Just one significant effort to increase taxonomic sampling has been made in a study that included 69 gymnosperms (83). The fact that Bayesian or ML analysis of their data yields a highly supported gnetifer tree is intriguing (13; S. Mathews, unpublished data). However, it is unclear whether this might result from increased taxonomic sampling, from the choice of loci (13), or both. The result may be misleading, or it may be that the set of loci analyzed by 83 serendipitously captured the signal of the species phylogeny.
Analyses of morphological data also have included relatively few taxa. Because the fossil record suggests that there are many distinctive lineages that cannot be assigned to modern groups, the pattern of seed plant evolution cannot be determined without analyses of morphological evidence. However, the detailed morphological investigations of living taxa that are required to properly interpret fossil material are often lacking (22). A further challenge to interpreting the fossils is the difficulty and slow pace of reconstructing entire fossil plants from dispersed fossil organs. Thus, while whole-plant reconstructions are the standard for which we should strive, it also will be important to experiment with the inclusion of incomplete fossils because these may increase phylogenetic accuracy (92, 72).
CHARACTER SAMPLING
The increasing ease with which nucleotide characters can be accumulated means that it is particularly important to grapple with the question of how best to do so and/or with the question of how best to analyze concatenated data sets. Although adding characters may increase phylogenetic accuracy (e.g., 45), both theoretical and empirical studies have shown that it does not always do so and that, in fact, adding characters in some cases increases support for an erroneous tree (e.g., 40; 55; 87; 72; 64; 79). In at least some cases, gene trees will not match the species tree, and for some combinations of branch lengths in the species trees, incongruent trees may actually be more likely than congruent gene trees (24; 56). In these cases, the most frequently observed gene tree in combined data will be an incorrect estimate of the species tree (24). Thus, when data are concatenated from many loci, it is important to explore the different methods available for analyzing these data sets, particularly those appropriate for highly heterogenous data sets (e.g., 69; 11; 37; 58).
A contrasting problem exists with respect to morphological characters. Relatively few structural characters have been identified that can be scored for morphological analyses. Here both effort and new techniques (e.g., 41) are needed. One concern surrounding the paucity of morphological characters that can be included in a phylogenetic matrix is that if added to a matrix of nucleotide characters, their signal would be swamped. With this in mind, it would be interesting to test the results of combining morphological characters with subsets of a nucleotide matrix. For example, in the case of seed plant analyses, where the faster evolving sites are likely to be saturated and may have little information regarding deeper divergences in the tree, one might combine just the slowest evolving sites with the morphological characters.
SOME RECENT STUDIES
One of the largest character sets to date has been assembled by Rai and Graham (77) to address both conifer and higher order seed plant relationships. Their study uses a strategy of sampling 17, noncontiguous and functionally diverse regions of the plastid genome, in total comprising approximately 14.1 kb unaligned, about one ninth of the genome. Two trees have been inferred from these data, sampled from 38 species (28 of which are gymnosperms). The parsimony tree is identical to the tree in Fig. 3B, with gnetophytes sister to all seed plants, but the topology of the ML tree is novel: gnetophytes are sister to all seed plants, but conifers are sister to a clade in which Ginkgo is sister to cycads + angiosperms. If the rooting of this tree is wrong and if it were to be rerooted between Ginkgo and cycads, it would give a coniferophyte clade (sensu 17) on the one hand and a clade of cycads and angiosperms on the other. Substantially larger plastid data sets were analyzed by 94 and 65, sampling 56 and 57 plastid genes, respectively. However, each study included only four gymnosperm genera (Cycas, Ginkgo, Pinus, and Gnetum in 94; Cycas, Ginkgo, Pinus, and Welwitschia in 65) and so cannot be used to test the relationships of conifers and gnetophytes. As in previously published studies, either Gnetum or Welwitschia and Pinus are sister taxa (e.g., Fig. 3A; all trees in 94; ML and Bayesian trees in 65), or Gnetum or Welwitschia are sister to all other seed plants (e.g., Fig. 3B; MP and NJ trees in 65). An alternative approach for assembling a large character set is to sample EST databases, which has the added value of sampling nuclear genes. A recent analysis of seed plant EST data from Cycas, Ginkgo, Pinus, and Gnetum (23) placed Gnetum and Pinus in a well-supported clade. The utility of ESTs may be best exemplified in a recent study in which a combination of newly generated and published EST data were analyzed to resolve multiple long standing phylogenetic questions in animal phylogeny (35). What may have been a key in the apparent success of the study was the strategic accumulation of new EST data to fill in critical taxonomic gaps.
Supermatrices are an alternative to phylogenomic approaches that use orthologous genes from whole genome or EST sequences of a relatively small number of taxa. Supermatrices assembled from data in GenBank take advantage of the large number of sequences deposited there from phylogenetic and population studies. Due to very heterogenous sampling (few taxa represented by many genes, many taxa represented by few genes), these supermatrices may have sequences from many more taxa, but will also have a high percentage of missing data (e.g., 34; 66). More than 700 gymnosperms are represented in GenBank by at least one sequence and approximately 680 were included in a supermatrix assembled by Burleigh and Mathews (unpublished data). The matrix has 88 815 sites, but 95.4% of the data cells are empty. Relationships among the major seed plant clades are highly supported in trees inferred from this sparse supermatrix, and gnetophytes are united not with Pinaceae but with cupressophytes (all conifer families but Pinaceae). This is true of both the ML and MP bootstrap trees, except for the MP trees that include outgroup sequences, in which case, gnetophytes are sister to all other seed plants (J. G. Burleigh and S. Mathews, unpublished data). However, analyses of a denser matrix (taxa trimmed to include only those with a minimum of 10 000 nucleotides of data in the matrix, leaving 38 gymnosperms, 12 angiosperms, and 4 outgroups) yield gnepine trees, except again in the case where parsimony is used to analyze the matrix that includes outgroup data, which yields gnetophytes as sister to all other seed plants. Overall, these data thus reduce confidence in gnepine trees, but provide additional support for a link between conifers and gnetophytes.
Duplicate gene data sets allow the inference of rooted species trees without the inclusion of sequences from outgroups (42; 53; 26; 62, 10). This may be particularly worth exploring in analyses of seed plant molecular data because free-sporing and seed plants last shared a common ancestor up to 380 million years ago (73), and because all the basal seed plant lineages are extinct, making it hard to employ the strategy of adding taxa to break up the very long branch from free-sporing plants to extant seed plants. Some preliminary results from analyses of a duplicate phytochrome gene data set from seed plants (S. Mathews and M. J. Donoghue, unpublished data) are worth commenting on here because they indicate a level of uncertainty in the rooting of seed plant phylogenies inferred from sequence data that has not been suggested by other studies. These analyses focus on three phytochrome genes, PHYN/A, PHYO/C, and PHYP/B, which are related as depicted in Fig. 4A. The data sets are incomplete, and I highlight here just two patterns observed in the PHYN/A clade, where the data are most complete. One question being addressed in these analyses is whether different topologies are inferred when sites are successively excluded from searches based on their rate class category, beginning with removal of the fastest sites and ending with inclusion of only the slowest. In particular, what do topologies inferred from the sites estimated to be evolving most slowly suggest about the rooting of the seed plant phylogeny and about the position of the gnetophytes? A rationale for this approach is the expectation that at least some rapidly evolving sites may be essentially randomized with respect to deep divergences (e.g., 89). Saturated sites will contribute to phylogenetic accuracy in many cases (95), but as noted by 12, sites in different rate classes may favor different topologies. This appears to be the case where the placement of the root is concerned. In analyses that differed with respect to which sites were included based on their rate class assignment, two topologies were recovered, one that has a gnepine clade and that places angiosperms as sister to a gymnosperm clade (Fig. 4B) and one that is novel, uniting cycads and angiosperms in a clade that is sister to the remaining gymnosperms (Fig. 4C). The relationship between topologies and the set of rate classes included in the analysis is complex, but generally, as faster evolving sites are successively excluded, ML bootstrap support for cycads being sister to the remaining gymnosperms tends to drop while support for a clade of cycads and angiosperms increases. In contrast, support for the gnepine clade is remarkably consistent across the analyses, and even when just sites in the four most slowly evolving rate classes are analyzed, the clade receives 100% maximum likelihood bootstrap support. The gnepine result is not unexpected, but the nonmonophyly of extant gymnosperms in gnepine trees is surprising given the support this split that is seen in other analyses (10; 18; 67; 46; 85; 12).
CONCLUDING REMARKS
Significant uncertainty persists in seed plant phylogenies inferred from both molecular and morphological data. Analyses of supermatrices (J. G. Burleigh and S. Mathews, unpublished data) and plastid genome data sets (20) bring a new twist to the question of the position of gnetophytes, maintaining a link with conifers but placing them sister to cupressophytes. This adds to the number of published DNA sequence data sets that have yielded highly supported but conflicting trees, all of which cannot be correct. To some extent, analytical issues encountered in the estimation of seed plant phylogeny may arise from the fact that given the nature of the problem, only limited insight is gained from data sets with few taxa and many characters. This can be addressed by sampling sequence data from more taxa, particularly from extant gymnosperms, so that living seed plant diversity is better represented in nucleotide data sets. Still, our best efforts to sample extant taxa more adequately for sequence data will leave fundamental questions unanswered. Perhaps chief among these, and most relevant to this volume, is the identity of the angiosperm sister group. Resolution of this question, as well as a general understanding of seed plant evolution, will not be obtained without rigorous morphological analyses, and therein lies a challenge. This will require that we identify approaches for incorporating our insights from data that may be accurate but perhaps less likely than sequence data to generate results supported by high bootstrap values. High bootstrap values give us confidence in the groups we are trying to delineate. However, the knowledge that erroneous clades can be highly supported should temper our thinking, especially in cases where other lines of evidence are contradictory, even if not well supported. It is possible that our tendency to prefer the hypothesis with high support values, and to be uncomfortable with uncertainty, may at least sometimes lead us astray. How best to weigh evidence and distinguish among hypotheses when some types of data are likely to give high support values and others are not remains an important problem in plant systematics.