Orthologous nuclear markers and new transcriptomes that broadly cover the phylogenetic diversity of Acanthaceae

Premise Information on orthologous groups of genes, their sequence variability, and annotation is required for project design in phylogenetic reconstruction. This resource is unavailable for the flowering plant family Acanthaceae (>4000 species). Methods We compared transcriptome sequences spanning the extant diversity of Acanthaceae in order to provide a set of orthologous low‐copy nuclear genes and assess their utility for reconstructing phylogenetic relationships within this group of plants. Results We present new transcriptome assemblies for eight species representing all major clades of Acanthaceae. The assemblies of five of these species are entirely based on new sequence data. Of these five species, three are from subfamilies for which no genomic resources were previously available (Nelsonioideae and Thunbergioideae). These five new transcriptomes are more complete than all others from public databases. Furthermore, we provide alignments with sequence information, annotation, and statistics for potential phylogenetic utility of 1619 orthologous low‐copy nuclear markers. Discussion Our method of inferring assemblies from multiple pooled tissue samples delivers more complete transcriptomes than any available ones from Acanthaceae. We make available to the community new resources (e.g., sequence information, variability, and annotation of orthologous low‐copy nuclear genes) that will help phylogenetic reconstruction in Acanthaceae.

(A. J. Borg and J. Schönenberger, unpublished data). The methods currently accessible for sequencing molecular markers without prior sequence information of orthologous loci are either expensive (e.g., whole genome sequencing) or inappropriate (e.g., restriction site-associated DNA, which has low confidence for homology assessment [i.e., potential paralogy, high levels of missing data, and low reproducibility]) for phylogenetic inference at deeper phylogenetic levels or in older clades. The establishment of low-copy nuclear genes (LCNG) suitable for phylogenetic analysis would help to further clarify the evolutionary history of Acanthaceae and of the Lamiales.
Compared to other currently used strategies of genome reduction prior to sequencing in plant systematics, high-throughput targeted capture (Gnirke et al., 2009) offers several advantages (recently reviewed by Johnson et al., 2019) and has been widely applied in plant systematics and evolution. Transcriptome sequences have been successfully used to develop probe sets for targeting nuclear markers in several plant groups (Chamala et al., 2015;Landis et al., 2016;Crowl et al., 2017;García et al., 2017;Villaverde et al., 2018;Johnson et al., 2019;Vargas et al., 2019). The hybridization between RNA probes and DNA sequences is directly linked to their similarity. Hybridization leads to efficient target enrichment if sequence similarity between RNA probes and DNA sequences shows at least 85% similarity (Orin McCormick, RAPiD Genomics, unpublished data). Therefore, we decided to obtain sequence information for designing specific probes for Acanthaceae.
Currently, there is only a single draft genome published for Acanthaceae (Ruellia speciosa Mart. ex Nees, subfamily Acanthoideae; Zhuang and Tripp, 2017), but no genomic resources are available for the subfamilies Thunbergioideae and Nelsonioideae, which comprise some 180 and 172 species, respectively. To maximize the potential for successful hybridization to probe sequences from a set of phylogenetically diverse species such as the Acanthaceae (approximately 80 million years old; Tripp and McDade, 2014), it is critical to include a phylogenetically broad set of taxa when designing probes. In line with this, we generated new transcriptomic data for five species, representing all major clades of Acanthaceae (Fig. 1). Next, we compared our own sequences with transcriptomic data available for Acanthaceae in public data repositories (NCBI Resource Coordinators, 2016). We provide information on the utility for phylogenetic inference of orthologous loci within this plant group. This study provides a much-needed set of nuclear markers that will facilitate phylogenetic reconstruction within the family Acanthaceae, as well as in Lamiales.

Sampling
In order to sample all the major clades (the three first splits in the phylogeny of crown group Acanthaceae, according to Tripp and MacDade, 2014) within Acanthaceae (Fig. 1), we carried out RNA sequencing of five species and added data from three additional species from the National Center for Biotechnology Information Short Read Archive (NCBI SRA). We provide new transcriptomic data for two subfamilies lacking genomic resources: (1) Nelsonioideae (Elytraria caroliniensis (Walter ex J. F. Gmel.) Pers. and (2) Thunbergioideae (Mendoncia retusa Turrill and Thunbergia erecta (Benth.) T. Anderson). In addition, we sequenced two species that represent major lineages within the subfamily Acanthoideae, Pachystachys lutea Nees and Aphelandra aurantiaca (Scheidw.) Lindl. (voucher information given in Appendix 1). In order to further increase our sampling ( Fig. 1) and to obtain more information on sequence variability, we also retrieved RNA sequencing data for the species Acanthus leucostachyus Wall. ex Nees (representing Acantheae), Andrographis paniculata (Burm. f.) Nees (representing Andrographideae), and Avicennia marina (Forssk.) Vierh. (representing Avicennioideae) from the NCBI SRA (Appendix 1). A recent phylogenomic study comprising Acanthaceae sensu stricto presents a different topology from the one presented by Tripp and McDade (2014) (Amanda Fisher, California State University, unpublished data). However, our sampling still comprises all major clades in the family according to this new topology.

Sample preparation and sequencing
All tissues for RNA sequencing were freshly collected in botanical gardens (see Appendix 1 for voucher information). FIGURE 1. Phylogenetic relationships among major clades of Acanthaceae according to Tripp and McDade (2014). Names of species sampled are given on the right side. Taxa for which we provide new data are in green, and taxa for which we used GenBank data are in blue.

Data cleaning, transcriptome assembly, and annotation
Data quality was visually assessed with FastQC version 0.11.4 (Andrews, 2010) before and after data filtering and trimming. Adapter sequences (the first 13 base pairs), low-quality reads (Phred score < 33), and reads shorter than 50 bp were removed with Trimmomatic-0.35 (Bolger et al., 2014). The sequences were assembled into putative transcripts using Trinity version 2.1.1 (Haas et al., 2013). General statistics for quality assessment of transcriptome assemblies were obtained with the package GenomeTools (Gremme et al., 2013). We used the TRAPID pipeline (Van Bel et al., 2013) based on the PLAZA 2.5 database to get protein translations and to assess the number of fully or quasi-fully sequenced transcripts. We used BUSCO version 3.0.2 (Simão et al., 2015) with the embryophyte single-copy ortholog set to assess the completeness of the transcriptome assemblies. All commands used for this study are available at Figshare (https ://figsh are.com/s/7c914 97e3f 1cd0 ceed7 ).

Orthology assessment
To minimize the possibility of obtaining paralogous loci, we aimed at finding orthologous LCNG most appropriate for phylogenomic analyses. These genes are generally highly conserved and are, therefore, not ideal to resolve shallow phylogenetic relationships. However, they often contain introns with greater levels of variability, making them useful for a broad range of phylogenetic analyses even at low taxonomic ranks. We used MarkerMiner (Chamala et al., 2015) to establish groups of orthologous genes by using transcriptome assemblies and their protein translations as input. This approach uses the same predefined set of genes as a reference and has been successfully applied to capture sequences in other phylogenetic studies in angiosperms (e.g., Nicholls et al., 2015;Landis et al., 2016;Crowl et al., 2017;García et al., 2017;Villaverde et al., 2018;Vargas et al., 2019).
There is currently no taxon closely related to Acanthaceae with a well-annotated high-quality genome available. We were unable to use (with exonerate; Slater and Birney, 2005) the Ruellia speciosa genome (Zhuang and Tripp, 2017) as a reference in our analysis due to its low contiguity leading to many fragmentary gene models. Therefore, we decided to use the genome of Arabidopsis thaliana (L.) Heynh. as a reference. For each orthogroup recovered by MarkerMiner, we calculated statistics to estimate its phylogenetic utility (e.g., alignment length, number of variable sites, number of parsimony informative sites, AT and GC content) using AMAS (Borowiec, 2016). The output from MarkerMiner gives wellannotated alignments for each orthogroup, including the boundaries of exonic regions in the assembled transcripts (alignments are available at https ://figsh are.com/s/9903a acaaa 3c34b c9ed9 ).

Transcriptome assembly
We compared transcriptome assemblies (available at https ://figsh are.com/s/aa884 dbe56 5dd1f 453b2 ) of eight species of Acanthaceae, which represent all major clades (Fig. 1) within this family. Five of these species had no transcriptomic data resources previously available in public databases. Of these five species, three are from subfamilies (Nelsonioideae and Thunbergioideae) for which no genomic resources were available at all. Our transcriptome sequencing resulted in a total of 41-58 million raw reads per sample (Table 1). The assembly of quality-filtered and trimmed reads produced 85,504-286,084 contigs per species. The functional annotation from TRAPID identified 7616-86,113 fully or quasi-fully sequenced transcripts per species (Table 1). Transcriptomes were 83-47.4% complete according to BUSCO (Table 2) (Simão et al., 2015). These five new transcriptomes are more complete than all others from public databases (Table 2).

Phylogenetic utility
We found 1619 putative orthologous LCNGs for Acanthaceae (alignments available at https ://figsh are.com/s/9903a acaaa 3c34b c9ed9 ). Here we provide sequence information for bait design in Acanthaceae, offering flexibility of choice based on variability, presence or absence of species, intron size, and gene size. We make available the set of 1619 alignments from which baits are designed. On average, 3.68 species occurred in each of the orthogroups, which exhibited zero to 0.673 variable sites per position (0.448 on average, median value 0.411). We found 50 orthogroups that included transcripts for all species (eight). This number has increased when orthogroups were required to contain sequence data for at least seven, six, five, and four species (160, 369, 590, and 840, respectively). Parsimony informative sites per orthogroup ranged from zero to 2603 (362 on average, median value 225) (statistics for potential phylogenetic reconstruction available at https ://figsh are.com/s/ ebb5b 55c72 1debd accb4 ).

DISCUSSION
The exons of all LCNGs (except nine that do not have any variable site: AT1G21370, AT1G31500, AT1G79120, AT3G25530, AT4G01030, AT4G18975, AT4G28830, AT4G38370, and AT5G14140) were found have the potential to solve phylogenetic relationships at deeper nodes (see alignments available at https :// figsh are.com/s/9903a acaaa 3c34b c9ed9 ). We observed that exons are more conserved among species of the same major clade of Acanthaceae (e.g., Mendoncia and Thunbergia, or Acanthus and Aphelandra). The variation within Thunbergioideae (Mendoncia and Thunbergia) is even lower. In order to resolve relationships among closely related species, targeting flanking regions of exons is an efficient approach to sequence the more variable introns (if the DNA is degraded and sequence reads are short, shorter introns are easier to capture and to sequence), which likely provide more phylogenetic information within the Acanthaceae. In this case, longer sequencing reads are desired to sequence introns captured by RNA baits, which are usually designed for exonic regions. Designing RNA primers for PCR is an alternative to hybrid capture. This method is efficient for amplifying long genes, including the introns (e.g., Valderrama et al., 2018). Vargas et al. (2019) published a python script (GoldFinder) to sub-select markers from the output of MarkerMiner (alignments available at https ://figsh are.com/s/9903a acaaa 3c34b c9ed9 ) according to five relevant criteria for most users who work with molecular phylogenetics and evolution: (1) marker length, (2) percentage of short exons (relative to bait length), (3) number of user's sequences per marker, (4) similarity, and (5) bait number, length, and coverage. GoldFinder makes the sub-selection task automatic and informed, so that it could be easily run using the data provided on Figshare (https ://figsh are.com/s/9903a acaaa 3c34b c9ed9 ). Johnson et al. (2019) developed a universal probe set for targeted sequencing of 353 nuclear genes from any angiosperm. However, the efficiency of hybridization of this probe set varied considerably across the different species tested and in some species/clades it was rather low (e.g., 5%, median for all samples was 24.8%). Johnson et al. (2019) selected sequences with up to 30% divergence to design their probes. Accordingly, the capture efficiency for these targets will most likely be improved by designing baits more specific for Acanthaceae, which have not been included in their study. The transcriptome sequences we make available here are a crucial resource for this purpose and can be used to further improve universal probe sets, such as the one by Johnson et al. (2019).
Our method of inferring assemblies from multiple pooled tissue samples delivers more complete transcriptomes than any previously available from Acanthaceae. In addition to being useful for phylogenetic analyses (the main goal of this study), the data generated here provide a potentially important basis for a wide array of other research projects, such as population genomic analyses, metabolic pathway investigations, gene prediction, crop improvement, and analyses of phenotypic diversity. Here, we provide a comparative analysis of representatives of Acanthaceae with the necessary tools for RNA bait design.

DATA AVAILABILITY
The following data are available on Figshare: transcriptome assemblies (https ://figsh are.com/s/aa884 dbe56 5dd1f 453b2 ), alignments for each low-copy nuclear gene (input for baits design) (https :// figsh are.com/s/9903a acaaa 3c34b c9ed9 ), statistics describing their potential for phylogenetic reconstruction (https ://figsh are.com/s/ ebb5b 55c72 1debd accb4 ), and commands used to perform analyses (https ://figsh are.com/s/7c914 97e3f 1cd0 ceed7 ). Raw sequence data are available in the National Center for Biotechnology Information Sequence Read Archive (accession numbers are shown in Appendix 1). Species names with asterisks refer to data downloaded from the National Center for Biotechnology Information; species without an asterisk refer to transcriptomes generated in this study.