Phylogeny and multiple independent whole‐genome duplication events in the Brassicales

Premise Whole‐genome duplications (WGDs) are prevalent throughout the evolutionary history of plants. For example, dozens of WGDs have been phylogenetically localized across the order Brassicales, specifically, within the family Brassicaceae. A WGD event has also been identified in the Cleomaceae, the sister family to Brassicaceae, yet its placement, as well as that of WGDs in other families in the order, remains unclear. Methods Phylo‐transcriptomic data were generated and used to infer a nuclear phylogeny for 74 Brassicales taxa. Genome survey sequencing was also performed on 66 of those taxa to infer a chloroplast phylogeny. These phylogenies were used to assess and confirm relationships among the major families of the Brassicales and within Brassicaceae. Multiple WGD inference methods were then used to assess the placement of WGDs on the nuclear phylogeny. Results Well‐supported chloroplast and nuclear phylogenies for the Brassicales and the putative placement of the Cleomaceae‐specific WGD event Th‐ɑ are presented. This work also provides evidence for previously hypothesized WGDs, including a well‐supported event shared by at least two members of the Resedaceae family, and a possible event within the Capparaceae. Conclusions Phylogenetics and the placement of WGDs within highly polyploid lineages continues to be a major challenge. This study adds to the conversation on WGD inference difficulties by demonstrating that sampling is especially important for WGD identification and phylogenetic placement. Given its economic importance and genomic resources, the Brassicales continues to be an ideal group for assessing WGD inference methods.


INTRODUCTION
The Brassicales are an economically important order of flowering plants, home to many crop species such as kale, broccoli, cabbage, cauliflower, papaya, capers, and canola as well as several  Figure 1; APG IV 2016). Together, the order is dated around 103 mya, with extant species contributing to 2.2% of the total core eudicot diversity (Magallon et al. 1999;Cardinal-McTeague et al. 2016). Previous research has identified multiple whole-genome duplication (WGD) events across the order using a variety of comparative methods, lineages are recovered as supported by previous literature (Huang et al. 2016;Nikolov et al. 2019) with Aethionema arabicum as sister to the rest of the family, followed by Lineage III sister to [Lineage I + Clade C + Lineage II and expanded Lineage II] and Lineage I sister to [Clade C + Lineage II and expanded Lineage II]. Within Cleomaceae, the relationships were also mostly congruent with previous nuclear phylogenies (Patchell et al. 2014;Cardinal-McTeague et al. 2016) with the same exception of the placement of Polanisia in relation to the other clades of Cleomaceae as discussed above. Finally, our limited sampling of the Capparaceae again limits our ability to say much about the relationships within the family, however to date there is no phylogeny for the family based solely on nuclear data.

Known WGD Events Not Recovered with Strong Support When Sampling Across the Brassicales
Two of the most popular methods used to detect WGD include phylogenomics -using individual gene tree topologies, gene counts, and a known species tree -and Ks plots, which allow for the identification of signatures left behind in paralogs after WGD. In order to provide multiple lines of evidence for novel WGD events. We use a combination of these two approaches to test hypotheses of proposed WGD across the Brassicales. Using PUG (github.com/mrmckain/PUG), a phylogenomic WGD estimation method, resulted in the recovery of some known events with high support (e.g. At-ɑ and At-β), yet failed to produce strong support for other known events such as the Brassiceae triplication event when including all taxa (Figure 1). We note that PUG does indicate that there are 65 unique gene duplications that match that node when considering gene trees with 80% bootstrap support. However, when compared to other known events (At-ɑ and Atβ) with counts over 300 and 150 respectively, this is surprisingly low. Therefore, to increase the number of orthogroups used to infer species trees, as well as increase the number of gene trees to query putative paralogs against, we further broke down analyses to the familial level. By running the Brassicaceae, Capparaceae, Cleomaceae, and [Resedaceae + Bataceae + Moriagaceae + Caricaceae] families separately, we are able to improve WGD detection of known events.

Recovery of Known WGD Events in the Brassicaceae
Analysis of just the Brassicaceae family identifies not only At-ɑ at the base of the family, but also successfully identifies the Brassiceae triplication event (Lysak et al. 2005; Figure 2). We also recover up to five additional neopolyploid events between; 1) Chorispora tenella and  Figure 5). However, for the neopolyploid events within the 12 evidence for Th-ɑ (Th-ɑ H3; Figure 4B). This result is perplexing, and could indicate that the data from A. viscosa is either of poor quality or the genome itself has lost a large enough fraction of the duplicates that the signal for this event is not detected. To further test for the placement of Th-ɑ we expanded our comparisons to include the ortholog divergence of Coalisina angustifolia and T.
hassleriana as well as the divergence between C. violacea and T. hassleriana to test if the proposed independent WGD events between the two clades may in fact be a single event (Th-ɑ H4; Figure   4C). We surprisingly recover both of these divergences to be of about the same age as Th-ɑ, which would therefore have us conclude that Th-ɑ is shared across this whole clade, and is not two separate events as illustrated in Figure 3. Comparison of ortholog divergence to Ks peaks for the two other identified WGD events using phylogenomics suggest that there is no other WGD event in the Cleomaceae (Supp. Figure 6A & 6B).

Conflicting Evidence for WGD in the Capparaceae
In agreement with Lysak (2018), PUG recovers evidence for an independent WGD event in the Capparaceae, shared between a species of Capparis and another species of Capparaceae included in our analyses ( Figure 5A). This event is also supported by Ks plots using FastKs, but not DupPipe, with a peak centered at Ks ~ 0.3 ( Figure 5A). Ortholog divergences between members of the Capparaceae also show conflicting patterns. When comparing Ks values of Boscia sp, Capparis fascicularis, Capparaceae sp, and Cadaba natalensis from DupPipe to the ortholog divergence time between Boscia sp and Capparis fascicularis we find that the divergence between these two species occurred before the possible WGD event, agreeing with the PUG analysis.
However, all four taxa share a peak in their Ks value, although their Ks plots from both analyses are not in agreement, providing conflicting results for the identification of this event. The divergences tested between Boscia sp and Cadaba natalensis as well as between Capparis fascicularis and Cadaba natalensis also occurred before the proposed event. However, the divergence between Capparis fascicularis and the misidentified species of Capparaceae, seems to have occurred at the same time as peak in Ks values (Supp. Figure 7A).

Novel WGD Event in the Resedaceae
When combining the Resedaceae (Ochradenus barcardis and Reseda odorata), Bataceae, Moringaceae, and Caricaceae families together, we excitedly find strong evidence for a 13 Resedaceae specific WGD event in all three analyses with Ks plots indicating a peak ~ 0.4 ( Figure   5B). Ortholog divergences seem to additionally support the proposal of this novel WGD between the samples of Resedaceae. Both samples (Reseda odorata and Ochradenus barcardis) share a Ks peak around ~ 0.4 which occurs before the divergence between these two samples and after the divergence between Resedaceae from Batis maritima (Supp. Figure 7B). In addition, we recover evidence for At-β using both PUG and DupPipe (Ks ~ 1.7; Figure 5B).

DISCUSSION
Studies of the relationships within the Brassicales have either included many taxa but few genes (Hall et al. 2004;Hall 2008;Cardinal-McTeague et al. 2016), a few taxa and few genes (Rodman et al. 1998) or few taxa and many genes (Edger et al. 2015;Edger et al. 2018). In this study, we aim to find a balance of taxa and genes to present a well-supported chloroplast and nuclear phylogeny for the Brassicales, these both being in overall agreement with previous studies at the interfamilial and intrafamilial level (Edger et al. 2015;Cardinal-McTeague et al. 2016;Huang et al. 2016;Guo et al. 2017;Edger et al. 2018). Using the nuclear phylogeny, we highlight the difficulty in placing Th-ɑ and identify possible novel events in the Cleomaceae, Capparaceae, and Resedaceae.

Incongruences Between the Chloroplast and Nuclear Trees Across the Brassicaceae
Although relationships were congruent with previous analyses, we highlight the incongruence between the nuclear and chloroplast trees among the major lineages of the Brassicaceae, a welldocumented pattern between these genomes (Beilstein et al. 2008;Huang et al. 2016;Nikolov et al. 2019; summarized in Figure 6) analyses, we recover the congruent relationships, leaving us to conclude that the trees from these different genomes may never agree, potentially due to a complicated evolutionary history, such as ancient hybridization or introgression (Forsythe et al. 2018). These differences are important to consider when using the phylogeny to assess character evolution and divergence dating, as node ordering changes depending on which tree is used.

Putative Placement of Th-ɑ in the Cleomaceae
Previous studies have identified a WGD event unique to Cleomaceae (Th-ɑ) using a variety of sources from syntenic regions to ESTs (Schranz & Mitchell-Olds 2006;Barker et al. 2009;reviewed in Bayat et al. 2018). However, the placement of Th-ɑ within the Cleomaceae had yet to be confirmed. By using Ks plots to assess for signatures left behind in paralogs after WGD, phylogenetics using individual gene tree topologies, gene counts, and a known species tree, as well as ortholog divergences, we putatively place Th-ɑ as shared between Tarenaya hassleriana, Cleomaceae sp, Melidiscus giganteus, Gynandropsis gynandra, and Sieruela monophylla, as well as A. viscosa, Coalisina angustifolia, and Coalisina paradoxa (Th-ɑ H4; Figure 3). We have decided to include these last three species due to both the evidence from ortholog divergences and signatures in Ks plots that strongly suggest this event is shared with all species (Figures 3 & 4).
However, there is a chance that two separate events occurred independently and that A. viscosa does indeed lack a WGD. Ks plots of all samples listed above, other than A. viscosa, identify a peak hovering over Ks ~ 0.4, agreeing with previous studies (Barker et al. 2009;van den Bergh et al. 2014) which first identified this peak in Tarenaya hassleriana followed by Gynandropsis gynandra. PUG, however, supports two separate events. One possibility for the difficulty in placing this event, may be due to the short branch lengths found within this clade (Figure 1) or that this event, like others in the order, is actually a triplication event and will be difficult to tease apart.

Multiple WGD Events in the Cleomaceae?
In addition to identifying Th-ɑ, we also report two possible additional events in the Cleomaceae, both of which are identified in the Brassicales and Cleomaceae specific analyses, but with much more support in the analysis of just Cleomaceae species. These WGD events are placed at common ancestors shared between: 1) Cleome amblyocarpa, C. africana, and C. arabica and 2) four species 15 of Polanisia (Figure 3). Ks plots provide contrasting support for these events. Ks plots from FASTKs of C. africana and C. arabica show a small peak of duplicates hovering at Ks ~ 0.3, yet when the same data is run through DupPipe, there is no evidence of a WGD event. The Ks plots of C. amblyocarpa also provide conflicting evidence for this event. The Ks plot from FASTKs looks much more similar to that from C. violacea, which we know from sequencing its genome that it does not show evidence of any recent WGD event (Emery et al. 2018). While, the Ks plot from DupPipe of C. amblyocarpa complicates the story, with no clear peak identified. The second event, which is shared between four species of Polanisia is supported by a large number of unique gene duplications (2,200) using PUG, but is not supported by Ks plots from either FASTKs or DupPipe. The resulting plots look, again, more similar to C. violacea. Analyses of ortholog divergence between C. amblyocarpa, C. africana, and C. arabica also lack support for a WGD ( Figure 6A) as do analyses between the four species of Polanisia ( Figure 6B). To further test how WGD and C4 photosynthesis has evolved in this family, we suggest a study primarily focusing on Cleomaceae sampling. We know that C4 photosynthesis has evolved at least three times independently in Cleomaceae, specifically in (of the taxa sampled) Gynandropsis gynandra and of these samples share this event. Therefore, it will be interesting to investigate what the role of polyploidy, and more specifically Th-ɑ, is in character evolution in this group.

Novel WGD Events in the Capparaceae and Resedaceae
Although just two samples are included in our analysis, we recover some support for an event between at least one species of Capparis and a misidentified species of Capparaceae ( Figure 5A).
Due to this possible identification error, inconclusiveness from Ks plots, and no support in comparison between ortholog divergence and Ks peaks, this event, although supported by many unique duplicates in the PUG analysis, should be interpreted carefully. However, Lysak (2018), using chromosome counts, also proposed that Capparaceae had a unique event, making this an intriguing event to further investigate. It should be noted, however, that chromosome counts alone may be misleading in concluding that a WGD event has occurred (Evans et al. 2017). Alternatively, looking at just Ks plots for this group, it is still difficult to ascertain if there is a unique event.
Capers are typically much more woody than the others plants we sampled and therefore have 16 longer generation times, which needs to be accounted for when interpreting peaks derived from Ks plots. There is also a lack of agreement between Ks plots derived using FASTKs and DupPipe which estimate Ks values in different ways (pairwise Ks estimates in FASTKs versus estimates of Ks at nodes in gene trees in DupPipe), further confounding evidence for either a presence or absence of a Capparaceae-specific event. Between information presented by Lysak (2018) and the evidence presented here, this possible event certainly warrants additional study.
A separate WGD event in the Resedaceae was also hypothesized by Lysak (2018), here we find good evidence to support its presence. This is one of the few events recovered with consensus between Ks plots (from both FASTKs and DupPipe), phylogenetics, and ortholog divergences concluded that Moringaceae did not experience a family-specific genome duplication. Although we only surveyed two Resedaceae species, we feel this event is well supported and warrants additional sampling and investigation.

Methodological Challenges with Placing WGD Events; Sampling Matters
Currently three types of methods are used to detect WGD; Ks plots to assess for signatures left behind in paralogs after WGD, identification of retained duplicate blocks in a genome, and phylogenetics using individual gene tree topologies, gene counts, and a known species tree, with Ks plots and phylogenomics being the most approachable. All three of these methods however, have their limitations in identifying WGD events. As others have noted, and we have done here too, using a combination of approaches to test hypotheses helps to reduce the chance of proposing events that may not exist and simultaneously provides multiple lines of evidence for those events that are recovered. To account for this, and increase taxon and gene family occupancy in our datasets, we reduced sampling to just the family level. However, at each level of analysis, we had to choose an arbitrary cut-off for the number of duplicates that we felt were sufficient to infer a WGD event, a Using the two different methods as we did here, and across multiple species, allowed us to evaluate and compare putative peaks from different analyses to identify the expected signatures of WGDs. Further, paralogs from WGDs tend to be expressed more than those resulting from tandem duplications (Casneuf et al. 2006), using transcriptome data, as we did here, may actually yield data that is more enriched for WGD duplicates than using (fragmented) genomic data.
Therefore using transcriptome data, as shown by Tiley et al. (2018), may actually improve our success in detecting WGD events.
Overall, the Brassicales are an excellent group of plants to compare methods of WGD identification because of the wealth of genomic data available and known events. With many chromosome level genomes available, analyses based on syntenty, which seem to be regarded as most reliable in detecting these events (Nakatani & McLysaght 2019), can be used as controls for comparing WGD methods. Sequenced genomes, which are placed throughout the Brassicales, provide strong evidence for taxa in which we know do not have recent WGD events (i.e., Cleome violacea and Carica papaya) and taxa that do show evidence for recent WGD events (i.e. Arabidopsis thaliana and many Brassica crops). Because of these resources, we have calibration points that allow us to verify results when testing for novel events. Perhaps this group of plants, combined with recent insights on difficulties in placing WGD events, will help in furthering the development of innovative methods in describing and identifying WGDs.

Taxon Sampling
Sampling of 74 species of 57 genera across the Brassicales spanned seven families (Brassicaceae, Cleomaceae, Capparaceae, Resedaceae, Bataceae, Moringaceae, and Caricaceae), with a focus on the Brassicaceae (48 taxa) and Cleomaceae (17 taxa) (Supp. Table 2). Seeds were grown at the University of Missouri -Columbia or the University of Alberta in a sterile growth chamber environment. At maturity, but before flowering, leaf tissue was collected for both RNA and DNA extraction.   . Table 3). Lastly, two samples were isolated again using the ThermoFisher Invitrogen PureLink RNA mini kit (Invitrogen, Carlsbad, CA, USA), but were sequenced on a HiSeq for 2 X 250 bp reads (Supp. Table 3). All sequencing and library preparation for the above samples was performed by the University of Missouri DNA Core  we were only able to obtain partial regions of the chloroplast genome, and therefore they were not used in downstream analyses (Supp. Table 2 Table 1).

Whole-Genome Duplication
To estimate the phylogenetic placement of whole-genome duplications, PUG v2.1 (github.com/mrmckain/PUG) was used to query putative paralogs over multiple gene trees using the estimated ASTRAL-III tree as the input species tree. For each analysis, we used the original ML gene trees before running them through PhyloTreePruner (i.e., gene trees with all duplicates retained), the ASTRAL-III tree (rooted and with bootstraps removed), and parameters --estimate_paralogs and --outgroups Carica_papaya,Moringa_oleifera as input. Output duplicate gene counts were used only for those nodes with bootstrap values of 80 percent or better.
As another confirmation of duplication events, we constructed histograms giving the distribution of the synonymous rate of divergence (Ks) between paralogs in each transcriptome. This method allows for the potential identification of peaks in the distribution that may be indicative of a WGD event. The position of the peak along the Ks axis provides an estimate of time when the event occurred. Typically the peak closest to time zero (or Ks ~ 0) corresponds to recent tandem duplicates, not relevant to WGD events. Plots of Ks distributions were made for all taxa using To test for the best number of peaks to explain the data, we tested one to four components for each mixture model. We then picked the one with the lowest Bayesian information criterion (BIC) score as the best fit (Supp . Table 4). Although for most taxa four components had the lowest score, we emphasize that this does not mean that there are four WGD duplication events.
To further test for phylogenetic placement of WGD events, ortholog divergence was estimated using OrthoPipe as described in Barker et al. (2010). Using the estimated ortholog divergence and DupPipe Ks estimates, we are able to bookend the position of potential events by comparing when species diverged to the age of an estimated WGD event. If the ortholog divergence between pairs of species is older (larger Ks value) than the estimated age of a WGD event, one can conclude that those species do not share the event, however, if ortholog divergence between species is younger than the WGD, species do share the proposed event.

Accession Numbers
Sequence data from this article can be found in the NCBI SRA data libraries under BioProject accession number PRJNA542714. Individual BioSample accession numbers can be found in Supplemental Table 1.

Supplemental Data
The following materials are available in the online version of this article.  Table 2. Orthogroups retained for each analysis. Table 3. RNA and DNA extraction method, library preparation method, sequencing method, read size, and raw read numbers.

ACKNOWLEDGEMENTS
We would like to our funding sources including NSF grants IOS-1339156 and EF-1550838 to MSB, NSF IOS-1811784 to PDB, NIH PERT K12GM000708 to BS, and NSF DEB-1146603 to JCP. We also thank the University of Missouri DNA Core Staff -Nathan Bivens, Ming-Yi Zhou, and Karen Bromert for their patience and assistance in getting successful libraries for samples which we had very little DNA or RNA for and the University of Missouri Research Computing Support Services (RCSS); specifically Jake Gotburg, for his assistance in running large jobs across the Lewis cluster. We also thank Barb Sonderman for her assistance in helping to keep many of these plants alive. Finally, we thank Sarah Unruh and our anonymous reviewers for providing feedback in improving this manuscript.     Supp. Table 1. Taxon sampling, seed accessions and the collections they are from, additional analysis information, and SRA numbers for both RNA and genome survey sequencing (GSS) raw reads. SSC = small single copy, LSC= large single copy, IR = inverted repeat.
Supp. Table 3. RNA and DNA extraction method, library preparation method, sequencing method, read size, and raw read numbers.