Volume 109, Issue 4 p. 580-601
Open Access

Phylogenomic discordance suggests polytomies along the backbone of the large genus Solanum

Edeline Gagnon

Corresponding Author

Edeline Gagnon

Royal Botanic Garden Edinburgh, 20A Inverleith Row, Edinburgh, EH3 5LR UK

School of Biological Sciences, University of Edinburgh, King's Buildings, Mayfield Road, Edinburgh, EH9 3JH UK

Correspondence Edeline Gagnon, TUM School of Life Sciences, Technical University of Munich, Emil-Ramann-Str. 2, 85354, Freising, Germany.

Email: [email protected]

Search for more papers by this author
Rebecca Hilgenhof

Rebecca Hilgenhof

Royal Botanic Garden Edinburgh, 20A Inverleith Row, Edinburgh, EH3 5LR UK

School of Biological Sciences, University of Edinburgh, King's Buildings, Mayfield Road, Edinburgh, EH9 3JH UK

Search for more papers by this author
Andrés Orejuela

Andrés Orejuela

Royal Botanic Garden Edinburgh, 20A Inverleith Row, Edinburgh, EH3 5LR UK

School of Biological Sciences, University of Edinburgh, King's Buildings, Mayfield Road, Edinburgh, EH9 3JH UK

Search for more papers by this author
Angela McDonnell

Angela McDonnell

Negaunee Institute for Plant Conservation Science and Action, Chicago Botanic Garden, 1000 Lake Cook Rd, Glencoe, Illinois, 60022 USA

Search for more papers by this author
Gaurav Sablok

Gaurav Sablok

Finnish Museum of Natural History (Botany Unit), University of Helsinki, PO Box 7 FI-00014, Helsinki, Finland

Organismal and Evolutionary Biology Research Programme (OEB), Viikki Plant Science Centre (ViPS), PO Box 65, FI-00014 University of Helsinki, Finland

Search for more papers by this author
Xavier Aubriot

Xavier Aubriot

Université Paris-Saclay, CNRS, AgroParisTech, Écologie, Systématique et Évolution, Orsay, 91405 France

Search for more papers by this author
Leandro Giacomin

Leandro Giacomin

Instituto de Ciências e Tecnologia das Águas & Herbário HSTM, Universidade Federal do Oeste do Pará, Rua Vera Paz, sn, Santarém, CEP 68040-255, PA, Brazil

Search for more papers by this author
Yuri Gouvêa

Yuri Gouvêa

Departamento de Botânica, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais—UFMG, Av. Antônio Carlos, 6627, Pampulha, Belo Horizonte, CEP 31270-901, MG, Brazil

Search for more papers by this author
Thamyris Bragionis

Thamyris Bragionis

Departamento de Botânica, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais—UFMG, Av. Antônio Carlos, 6627, Pampulha, Belo Horizonte, CEP 31270-901, MG, Brazil

Search for more papers by this author
João Renato Stehmann

João Renato Stehmann

Departamento de Botânica, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais—UFMG, Av. Antônio Carlos, 6627, Pampulha, Belo Horizonte, CEP 31270-901, MG, Brazil

Search for more papers by this author
Lynn Bohs

Lynn Bohs

Department of Biology, University of Utah, Salt Lake City, Utah, 84112 USA

Search for more papers by this author
Steven Dodsworth

Steven Dodsworth

School of Life Sciences, University of Bedfordshire, University Square, Luton, LU1 3JU UK

Royal Botanic Gardens, Kew, Richmond, Surrey, TW9 3AE UK

Search for more papers by this author
Christopher Martine

Christopher Martine

Department of Biology, Bucknell University, Lewisburg, Pennsylavania, 17837 USA

Search for more papers by this author
Péter Poczai

Péter Poczai

Finnish Museum of Natural History (Botany Unit), University of Helsinki, PO Box 7 FI-00014, Helsinki, Finland

Faculity of Environmental and Biological Sciences, University of Helsinki, FI-00014, Finland

Search for more papers by this author
Sandra Knapp

Sandra Knapp

Department of Life Sciences, Natural History Museum, Cromwell Road, London, SW7 5BD UK

Search for more papers by this author
Tiina Särkinen

Tiina Särkinen

Royal Botanic Garden Edinburgh, 20A Inverleith Row, Edinburgh, EH3 5LR UK

Search for more papers by this author
First published: 16 February 2022
Citations: 17



Evolutionary studies require solid phylogenetic frameworks, but increased volumes of phylogenomic data have revealed incongruent topologies among gene trees in many organisms both between and within genomes. Some of these incongruences indicate polytomies that may remain impossible to resolve. Here we investigate the degree of gene-tree discordance in Solanum, one of the largest flowering plant genera that includes the cultivated potato, tomato, and eggplant, as well as 24 minor crop plants.


A densely sampled species-level phylogeny of Solanum is built using unpublished and publicly available Sanger sequences comprising 60% of all accepted species (742 spp.) and nine regions (ITS, waxy, and seven plastid markers). The robustness of this topology is tested by examining a full plastome dataset with 140 species and a nuclear target-capture dataset with 39 species of Solanum (Angiosperms353 probe set).


While the taxonomic framework of Solanum remained stable, gene tree conflicts and discordance between phylogenetic trees generated from the target-capture and plastome datasets were observed. The latter correspond to regions with short internodal branches, and network analysis and polytomy tests suggest the backbone is composed of three polytomies found at different evolutionary depths. The strongest area of discordance, near the crown node of Solanum, could potentially represent a hard polytomy.


We argue that incomplete lineage sorting due to rapid diversification is the most likely cause for these polytomies, and that embracing the uncertainty that underlies them is crucial to understand the evolution of large and rapidly radiating lineages.

Recent advances in high-throughput sequencing have provided larger molecular datasets, including entire genomes, for reconstructing evolutionary relationships (e.g., Ronco et al., 2021). Considerable progress has been made since the publication of the first molecular-based classification of orders and families of flowering plants (APG, 1998), with one of the most recent examples including a phylogenetic tree of the entire Viridiplantae based on transcriptome data from more than a thousand species (One Thousand Plant Transcriptomes Initiative, 2019). While large datasets have strengthened our understanding of evolutionary relationships and classifications across the Tree of Life, several of them have demonstrated repeated cases of persistent topological discordance across key nodes in birds (Suh et al., 2015; Suh, 2016), mammals (Morgan et al., 2013; Romiguier et al., 2013; Simion et al., 2017), amphibians (Hime et al., 2021), plants (Wickett et al., 2014; One Thousand Plant Transcriptomes Initiative, 2019), and fungi (Kuramae et al., 2006). Whereas previous expectations were that these “soft polytomies” would be improved with the addition of more data, their persistence after addition of more taxonomic and molecular data have led some authors to suggest that they actually represent “hard polytomies”, i.e., extremely rapid divergence events of three or more lineages at the same time or reticulate evolution due to species hybridization and/or introgression. In an era where obtaining genome-wide sampling of species for phylogenetic reconstruction has become mainstream, the question about whether persistent topological discordance can be resolved with more data or whether they reflect complex biological realities (Jeffroy et al., 2006; Philippe et al., 2011) is becoming increasingly common.

Discordance in phylogenetic signal can be due to three general classes of effects (Wendel and Doyle, 1998): (1) technical causes such as gene choice, sequencing error, model selection, or poor taxonomic sampling (Philippe et al., 20112017); (2) organism-level processes such as rapid or convergent evolution, rapid diversification, incomplete lineage sorting (ILS), or horizontal gene transfer (Degnan and Rosenberg, 2009), and (3) gene and genome-level processes such as interlocus interactions and concerted evolution, intragenic recombination, use of paralogous genes for analysis, and/or non-independence of sites used for analysis. Together, these biological and non-biological processes can lead to conflicting phylogenetic signals between different loci in the genome and hinder the recovery of the evolutionary history of a group (Degnan and Rosenberg, 2009). Consequently, careful assessment of phylogenetic discordance across mitochondrial, plastid, and nuclear datasets is critical for understanding realistic evolutionary patterns in a group, as traditional statistical branch support measures fail to reflect topological variation of the gene trees underlying a species tree (Liu et al., 2009; Kumar et al., 2012).

Here we explore the presence of topological discordance in nuclear and plastome datasets of the large and economically important angiosperm genus Solanum L. (Solanaceae), which includes 1,228 accepted species and several major crops and their wild relatives, including potato, tomato and brinjal eggplant (aubergine), as well as at least 24 minor crop species (website: Solanaceaesource.org, accessed November 2020). Building a robust species-level phylogeny for Solanum has been challenging because of the sheer size of the genus, and because of persistent poorly resolved nodes along the phylogenetic backbone. Bohs (2005) published the first plastid phylogenetic analysis for Solanum and established a set of 12 highly supported clades based on her strategic sampling of 112 species (9% of the total species number in the genus), spanning morphological and geographic variation. As new studies have emerged with increased taxonomic and genetic sampling (e.g., Levin et al., 2006; Weese and Bohs, 2007; Stern et al., 2011; Särkinen et al., 2013; Tepe et al., 2016), the understanding of overall phylogenetic relationships within Solanum has evolved to recognise three main clades: (1) the Thelopodium clade containing three species sister to the rest of the genus; (2) Clade I containing c. 350 mostly herbaceous and non-spiny species (including the Tomato, Petota, and Basarthrum clades that contain the cultivated tomato, potato, and pepino, respectively); and (3) Clade II consisting of c. 900 predominantly spiny and shrubby species, including the cultivated brinjal eggplant (Table 1). The two latter clades are further resolved into 10 major and 43 minor clades (Table 1).

Table 1. Number of species and taxon sampling across major and minor clades of Solanum. Clades are based on groups identified in previous molecular phylogenetic studies (Bohs, 2005; Weese and Bohs, 2007; Stern et al., 2011; Stern and Bohs, 2012; Särkinen et al., 2013; Tepe et al., 2016). Species number for each clade is based on current updated taxonomy in the SolanaceaeSource database (website: solanaceaesource.org, accessed November 2020). The 19 clades sampled in the pruned trees for the principal coordinate analysis in this study are in bold. New associated major clade names are given where applicable. Rows shaded in gray represent major and minor clades belonging to Clade II. The Eastern Hemisphere Spiny clade (EHS, formerly known as Old World spiny clade) comprises almost all the spiny solanums occurring in the eastern hemisphere.
Minor clade Associated major clade (Särkinen et al., 2013) New associated major clade (this study) Species Sampled species (%)
Supermatrix Plastome (PL) Target capture (TC)
Thelopodium Thelopodium 3 3 (100%) 1 (33%) 1 (33%)
African non-spiny M Clade VANAns 14 5 (36%) 1 (7%)
Normania M Clade VANAns 3 2 (67%) 1 (33%) 1 (33%)
Archaesolanum M Clade VANAns 8 8 (100%) 1 (13%) 1 (13%)
Valdiviense M Clade VANAns 1 1 (100%) 1 (100%) 1 (100%)
Dulcamaroid M Clade DulMo 45 25 (56%) 8 (18%) 1 (2%)
Morelloid M Clade DulMo 75 66 (88%) 15 (20%) 1 (1%)
Regmandra Potato Regmandra 12 6 (50%) 4 (33%) 1 (8%)
Herpystichum Potato 10 10 (100%)
Pteroidea Potato 10 10 (100%) 1 (10%)
Oxycoccoides Potato 1 1 (100%)
Articulatum Potato 2 2 (100%)
Basarthrum Potato 16 10 (56%) 3 (19%) 3 (19%)
Anarrhichomenum Potato 12 8 (82%)
Etuberosum Potato 3 2 (67%) 2 (67%) 1 (33%)
Tomato Potato 7 14 (82%) 8 (47%) 3 (18%)
Petota Potato 113 61 (54%) 38 (34%) 2 (2%)
Clandestinum-Mapiriense Clandestinum-Mapiriense 3 3 (100%) 1 (33%) 1 (33%)
Wendlandii-Allophyllum Wendlandii-Allophyllum 10 7 (70%) 1 (10%) 1 (10%)
Nemorense Nemorense 4 4 (100%) 1 (25%)
Pachyphylla Cyphomandra 39 32 (82%) 1 (3%)
Cyphomandropsis Cyphomandra 11 7 (64%) 1 (9%) 1 (9%)
Geminata Geminata 150 68 (45%) 5 (3%) 1 (1%)
Reductum Geminata 2 2 (100%) 1 (50%)
Brevantherum Brevantherum 83 29 (35%) 3 (4%)
Gonatotrichum Brevantherum 7 7 (100%) 1 (14%)
Inornatum Brevantherum 5 2 (40%) 1 (20%)
Trachytrichium Brevantherum 2 2 (100%)
Elaeagniifolium Leptostemonum 5 5 (100%) 1 (20%) 1 (20%)
Micracantha Leptostemonum 14 9 (64%) 1 (7%)
Torva Leptostemonum 54 34 (63%) 5 (9%) 1 (2%)
Erythrotrichum Leptostemonum 33 13 (39%) 1 (3%)
Thomasiifolium Leptostemonum 9 4 (44%) 1 (11%)
Gardneri Leptostemonum 10 8 (80%) 1 (10%)
Acanthophora Leptostemonum 22 13 (59%) 1 (5%) -
Lasiocarpa Leptostemonum 12 12 (100%)
Sisymbriifolium Leptostemonum 4 4 (100%) 1 (25%) 1 (25%)
Androceras Leptostemonum 16 15 (94%)
Crinitum Leptostemonum 23 10 (43%)
Bahamense Leptostemonum 3 3 (100%)
Asterophorum Leptostemonum 4 2 (50%)
Carolinense Leptostemonum 11 8 (73%) 1 (9%)
Hieronymi Leptostemonum 1 1 (100%) 1 (100%)
Eastern Hemisphere Spiny Leptostemonum 332 197 (59%) 24 (7%) 16 (5%)
Campechiense Leptostemonum 1 1 (100%)
Crotonoides Leptostemonum 3 2 (67%) 1 (33%)
Multispinum Leptostemonum 1 1 (100%) 1 (100%)
Unplaced Leptostemonum 9 1 (13%)
TOTALS: 1228 746 (60%) 140 (11%) 39 (3%)

Despite these advancements, phylogenetic relationships between many of the major clades of Solanum have remained poorly resolved, mainly due to limitations in taxon and molecular marker sampling. The most recent genus-wide phylogenetic study by Särkinen et al. (2013), based on seven markers (two nuclear and 5 plastid) and fewer than half (34%) of the species of Solanum, failed to resolve the relationships among major clades, especially within Clade II and the large component Leptostemonum clade, which includes the Old World spiny clade, comprising almost all spiny Solanum species that occur in the eastern hemisphere. To reduce colonial connotations associated with this name, we hereafter refer to this clade as the Eastern Hemisphere Spiny clade (EHS; Table 1).

To gain a better understanding of the evolutionary relationships of Solanum, we built a new Sanger supermatrix that included 60% of the species of the genus and compared the phylogenetic relationships obtained with the Sanger supermatrix with genus-wide plastid (PL) and nuclear target-capture (TC) phylogenomic datasets. We ask: (1) Does a significant increase in taxon sampling of the supermatrix dataset lead to significant changes in the circumscription of major and minor clades in Solanum? (2) Does increased gene sampling in both plastome and nuclear data resolve previously identified polytomies between major clades? (3) Is there evidence of discordance within and between genomic datasets? and (4) Are areas of high discordance in the Solanum phylogeny better represented by polytomies rather than bifurcating nodes? Comparison of the topologies from the different datasets, and results from discordance analyses, a filtered supertree network, and polytomy tests lead us to suggest that some of the soft polytomies of Solanum might be hard polytomies caused by rapid speciation and diversification coupled with ILS. We discuss the consequences that such an interpretation has for investigating the biogeography and morphological trait evolution across the economically important genus.


Taxon sampling

A Sanger sequence supermatrix was generated including all available sequences from GenBank related to the genus Solanum for nine regions: (1) the nuclear ribosomal internal transcribed spacer (ITS); (2) low-copy nuclear region waxy (i.e., GBSSI); (3) two protein-coding plastid genes matK and ndhF; and (4) five non-coding plastid regions (ndhF-rpl32, psbA-trnH, rpl32-trnL, trnS-G, and trnT-L). Only vouchered and verified samples were utilized. All sequences were blasted against target regions in USEARCH version 11 (Edgar, 2010). Taxon names were checked against SolanaceaeSource synonymy (website: solanaceaesource.org, accessed November 2020) and duplicate sequences belonging to the same species were pruned out to retain a single individual per taxon. A total of 817 Sanger sequences were generated and added to the matrix, adding 129 previously unsampled species and new data for 257 species (Appendix S1). Final species sampling across major and minor clades of Solanum varied from 13 to 100%, with 742 species of Solanum (60% of the 1228 currently accepted species as of November 2020; Table 1). Four species of Jaltomata Schltdl. were used as an outgroup (Appendix S1).

To assess phylogenetic discordance within Solanum, a set of species was selected for the phylogenomic study to represent all 10 major and as many of the 43 minor clades of Solanum as possible (Table 1), as well as the outgroup Jaltomata. The final sampling included 151 samples for the plastome (PL) dataset (140 Solanum species; Table 1 and Appendix S2) and 40 samples for the target-capture (TC) dataset (39 Solanum species; Table 1 and Appendix S3). For the PL dataset, 86 samples were sequenced using low-coverage genome skimming, and the remaining samples were downloaded from GenBank (November 2019). For the TC dataset, 12 samples were sequenced as part of the Plant and Fungal Trees of Life project (Baker et al., 2021) using the Angiosperms353 bait set (Johnson et al., 2019). In addition, 17 sequences were added from an unpublished dataset provided by A. McDonnell and C. Martine. Sequences for the remaining 12 samples were extracted from the GenBank SRA archive using the SRA Toolkit 2.10.7 (website: https://github.com/ncbi/sra-tools; Appendix S3).

DNA extraction, library preparation and sequencing

Supermatrix Sanger sequencing

DNA extractions for Sanger sequencing were done using DNeasy plant mini extraction kits (Qiagen, Valencia, California, USA) or the FastDNA kit (MP Biomedicals, Irvine, California, USA). Amplification of waxy followed Levin et al. (2005) using two (waxyF with 1171R and 1058F with 2R) or four primer pairs (waxyF with Ex4R, Ex4F with 1171R, 1058F with 3′N, and 3F with 2R). trnT-L was amplified with primers a-d and c-f (Taberlet et al., 1991; Bohs and Olmstead, 2001; Bohs, 2004). ndhF amplification followed Bohs and Olmstead (1997), psbA-trnH followed Sang et al. (1997), matK followed Rosario et al. (2019), ITS and trnS-G followed Levin et al. (2006), and rpl32-trnL and ndhF-rpl32 followed Miller et al. (2009). Sequencing was carried out on ABI automated sequencers at the University of Utah DNA sequencing facility (Salt Lake City, Utah, USA), at the Natural History Museum (London, UK), and at Myleus Biotecnologia (Belo Horizonte, Brazil). Contigs were visually checked in Sequencher version 4.8 (GeneCodes, Ann Arbor, Michigan, USA) and Geneious Prime 2020.1.1 (website: https://www.geneious.com). The combined matrix was 10,908 bp long (Appendix S4). The two most densely sampled regions (trnT-L and ITS) included 84% and 82% of the sampled species, respectively; waxy (54%) and ITS (67%) loci had the most parsimony informative characters (Appendix S4).

PL and TC datasets

DNA for high-throughput sequencing was extracted using the low-salt CTAB method (Arseneau et al., 2017) and quantified on a Qubit fluorometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA). Genome skimming was done at the Institute of Biotechnology, University of Helsinki (Finland). A paired-end genomic library was constructed using the Nextera DNA library preparation kit (Illumina, San Diego, California, USA). Fragment analysis was conducted with an Agilent Technologies (Santa Clara, California, USA) 2100 Bioanalyzer using a DNA 1000 chip. Sequencing was performed on an Illumina MiSeq platform from both ends with a read length of 150 bp. DNA extraction, quantification, and sequencing for TC followed Johnson et al. (2019). All PL and TC reads have been submitted to GenBank and the European Nucleotide Archive (Appendices S2 and S3).

Phylogenetic analyses

Overview of methodological strategy

Ten phylogenetic analyses with different methodological strategies were compared across the supermatrix, PL and TC datasets, to test if the phylogenetic results were robust despite these different choices (e.g., Philippe et al., 20112017; Saarela et al., 2018; Duvall et al., 2020). The Sanger supermatrix analyses based on Maximum Likelihood (ML) and Bayesian inference (BI) were used as a reference to compare results from the PL and TC species trees because the Sanger supermatrix had the most complete taxonomic sampling (Table 2). For the PL dataset, a total of four analysis were compared to test the effect of missing data and sampling on the resulting phylogenies, as well as the effect of different partitioning schemes in IQ-TREE2 (Table 2; Minh et al., 2020b). For the TC dataset, a total of four analyses were compared to test the effect of the phylogenetic method (ML vs. coalescent methods), missing data, and taxonomic sampling on the resulting phylogenies (Table 2). Full methods for all analyses are described below. All bioinformatic analyses were run either on the Toby-G1 server at the Royal Botanic Garden Edinburgh (Scotland, UK), or the Crop Diversity Server from the James Hutton Institute, in Dundee, Scotland, except for the supermatrix ML analysis.

Table 2. Overview of the 10 different analyses conducted across the Sanger supermatrix, plastome (PL), and target capture (TC) datasets. Acronyms indicate how each analysis is referred to in the figures and text. ML = Maximum Likelihood; BI = Bayesian Inference, A353 = Angiosperms353 bait set. See Materials and Methods section for full details.
Dataset Taxon and genomic sampling Phylogenetic method Partitioning scheme Acronym
Supermatrix 746 taxa, 9 loci ML: RaxML Supermatrix ML
BI: Beast2 Supermatrix BI
Plastome (PL) 151 taxa, full + partial plastomes ML: IQ-TREE2 Unpartitioned PL-151-UP
151 taxa, full + partial plastomes ML: IQ-TREE2 Best-Partition scheme PL-151-BP
125 taxa, full plastomes only ML: IQ-TREE2 Unpartitioned PL-125-UP
125 taxa, full plastomes only ML: IQ-TREE2 Best-Partition scheme PL-125-BP
Target capture (TC) (A353) 40 taxa, 338 exons ML: IQ-TREE2 TC-min04-ML
40 taxa, 338 exons Coalescent: ASTRAL-III TC-min04-ASTRAL-III
40 taxa, 303 exons ML: IQ-TREE2 TC-min20-ML
40 taxa, 303 exons Coalescent: ASTRAL-III TC-min20-ASTRAL-III

Supermatrix dataset

Sequences were aligned in MAFFT version 7 (Katoh et al., 2005), manually checked, and optimised. Short multi-repeats and ambiguously aligned regions were excluded manually or with trimAl (-gappyout method; Capella-Gutiérrez et al., 2009). Both ML and BI analyses were run on individual loci, as well as on a combined plastid alignment (seven loci in total) to check for topological incongruences, rogue taxa, and misidentified sequences. Visual checks revealed a small number of clear mis-determinations and/or lab errors. A further 26 samples were removed based on high RogueNaRok scores (Aberer et al., 2013). Nuclear sequence data (ITS and waxy) were identified for all known polyploid species (63 species, Appendix S5), and subsequently examined to determine if there were any strong incongruences with the results from the plastid loci. As none were found (Appendices S6 and S7), sequences from these species were kept in the final supermatrix analysis.

Maximum likelihood (ML) and Bayesian inference (BI) analyses were run on all nine loci individually and on the combined plastid dataset (seven loci). ML analyses were run in RaxML-HPC version 8.2.12 (Stamatakis, 2014) on XSEDE on CIPRES Science Gateway version 3.3 (Miller et al., 2010), with 10 independent runs based on unique starting trees. The General Time Reversible (GTR) model with CAT (Tavaré, 1986; Stamatakis, 2006) was used for all partitions. A total of 1,000 non-parametric bootstraps were run; bootstrap support (BS) ≥ 95% was considered strong, 75 to 94% moderate, and 60 to 74% weak.

BI analyses were run using Beast version 2.6.3 (Bouckaert et al., 2019), with two parallel runs sampling trees every 10,000 generations. ModelTest-NG (Darriba et al., 2020) was used to find the most suitable nucleotide substitution model for the individual loci and combined plastid loci; JC + G4 was specified for the ITS and trnS-G regions, GTR + G4 for the psbA-trnH, trnL-T, rpL32 and matK regions, and the GTR + I + G4 model for all other regions, as well as the combined plastid dataset and the full supermatrix dataset. For all analyses, an uncorrelated log-normal relaxed clock, birth-death tree prior, and a normally distributed UCLD.mean prior was specified (mean 1, SD = 0.3). All runs were checked with Tracer version 1.7.1 (Rambaut et al., 2018) to ensure that adequate effective sample sizes were reached (ESS > 200). LogCombiner and TreeAnnotator were used to generate the final maximum credibility tree with a 15% burn-in. Posterior probability (PP) values ≥0.95 were considered strong, and from 0.94 to 0.75 as moderate to weak.

The concatenated ML Sanger supermatrix analysis was run on a concatenated matrix, with the same settings as described above in RaxML. The concatenated BI Sanger supermatrix was analysed partitioning the dataset between ITS, waxy and the plastid genes. Modifications to the analysis included a monophyletic constraint on Solanum, and four parallel runs that were run for 60 million generations with two chains, sampling trees every 10,000 generations. The ML best tree was used as a starting topology to speed up convergence of the chains.

PL dataset

Paired reads from genome skimming were cleaned using BBDuk from the BBTools suite (sourceforge.net/projects/bbmap/; ktrimright = t, k = 27, hdist = 1, edist = 0, qtrim = rl, trimq = 20, minlength = 36, trimbyoverlap = t, minoverlap = 24, and qin = 33). Sequence quality was checked with FastQC (Andrews, 2010) and MultiFastQC (Ewels et al., 2016). Plastome assembly was done using de novo assembly with Fast-Plast version 1.2.6 (website: https://github.com/mrmckain/Fast-Plast), and reference-guided assembly using GetOrganelle version 1.6.2.e (Jin et al., 2020) with the high-coverage plastome sequence of S. dulcamara L. (GenBank KY863443; Amiryousefi et al., 2018). For GetOrganelle, the following settings were used: -w 0.6; -R 20; -k 85; 95; 105; and 127; for Fast-Plast, the Solanales Bow-tie index was used for the assembly. Results from both methods were aligned in Geneious and visually checked to determine consistency. Assembly quality was assessed using the reads identified from the Bow-tie step in the Fast-Plast analysis, which were mapped against the final recovered plastome sequence using BWA (Li and Durbin, 2010). Mean and standard deviation of coverage depth for each base pair was determined by examining the same files in Geneious. Assemblies were annotated using both Chlorobox GeSeq (Tillich et al., 2017) and the “Annotate from database” tool in Geneious using the reference plastid genome of S. dulcamara. Results were compared to ensure that start and stop codons for exon boundaries were congruent. Annotated plastomes were submitted to GenBank (Appendix S2). A total of 55 full plastomes were assembled with a mean length of 155,498 bp (max. 156,138 bp, min. 154,715 bp; Appendix S2), and a mean coverage of 158 (min. 22, max. 571; Appendix S2), and 28 partial plastomes (45,398 to 154,598 bp) with a mean coverage of 29 (min 4, max 96; Appendix S2). All plastomes had a highly conserved quadripartite structure, with no loss, duplication, or expansion of gene families.

Plastomes from this study and those retrieved from GenBank were aligned in Geneious using MAFFT (Katoh et al., 2005), visually checked, and corrected. A copy of the inverted repeat (IRa) was removed prior to phylogenomic analyses, although 1,189 bp were kept at the beginning of the region to be able to extract the gene that spans the boundary between the small single copy (SSC) and IRa region. We then separated the plastome alignment into: (1) 79 protein-coding regions; (2) 15 introns; and (3) 73 intergenic regions. For each dataset, the ambiguously aligned regions and polyA repeats were removed, using visual checks for the exons and intron regions, and the strict mode of trimAl (Capella-Gutiérrez et al., 2009) for the intergenic regions (Appendix S8). Sequences shorter than 25% of the length of the aligned matrix for each region and columns containing >75% of gaps were removed in trimAl (Capella-Gutiérrez et al., 2009) to avoid issues with long branch attraction following Gardner et al. (2021). Two pseudogenes (ycf1 and rps19) at the junction of IRa and Long Single Copy (LSC) (Amiryousefi et al., 2018), and four intergenic regions with no parsimony informative characters were excluded from the final analysis. All remaining loci alignments were concatenated together for the final PL phylogenetic analyses.

To test for the effect of missing data, two datasets were compared: (1) a matrix with 151 taxa containing all 140 species selected for this study with higher proportion of missing data (147,278 bp long with the second IR removed); and (2) a matrix with 125 samples containing only complete plastid sequences (Appendices S2 and S8).

ML searches were run on all PL datasets in IQ-TREE2 (Minh et al., 2020b) with 1,000 non-parametric bootstraps. Optimal substitution models were determined using –TEST in IQ-TREE2 (Appendix S9). For both PL datasets, topologies from two different partitioning schemes were also compared (unpartitioned vs. best-fit partition scheme based on PartitionFinder; Lanfear et al., 2012) in IQ-TREE2, to test if accounting for variation in substitution rate amongst loci affected the phylogenetic results. BS values ≥95% were considered strong, 75 to 94% moderate, and 60 to 74% weak.

TC dataset

Trimmomatic (Bolger et al., 2014) was used to trim reads (TruSeq. 3-PE-simpleclip.fa:1:30:6, LEADING:30, TRAILING:30, SLIDINGWINDOW:4:30, MINLEN:36). Read quality was checked with FastQC (Andrews, 2010) and MultiFastQC (Ewels et al., 2016). Over-represented repeat sequences were removed with CutAdapt (Martin, 2011). HybPiper (Johnson et al., 2016) was used to produce reference-guided de novo assembles using the reference provided by Johnson et al. (2019). Putative paralogs were identified using the HybPiper script “paralog_retriever.py”. Phylogenies were generated for all 45 loci for which paralog warnings were found using MAFFT (Katoh et al., 2005) and FastTree (Price et al., 2010). Five loci were deleted and several taxa whose paralogs caused paraphyly of clades were excluded from 27 loci (one to seven taxa per loci). A single gene (g5299) presented a clear duplication event and was divided into two separate matrices for downstream analyses.

Default HybPiper settings were used for all but three samples (S. betaceum Cav., S. valdiviense Dunal, and S. etuberosum Lindl.), for which the coverage cutoff was reduced from eight to four to maximise recovery of target genes. One sample (S. terminale Forssk.) was excluded due to poor sequence quality. Only the exon dataset was analyzed in downstream phylogenomic analyses, because the transcriptome dataset showed large differences in the recovered flanking regions of target loci between samples, likely due to post-transcriptional splicing and editing of messenger RNA. The HybPiper script “fasta_merge.py” was used to concatenate all genes together and produce a partition file. In summary, an average of 289 genes per sample were recovered for the TC analysis (min 48, max 340) when the two samples with low numbers were excluded (S. betaceum and S. etuberosum, Appendix S3). Furthermore, to reduce the effect of missing data and long branch attraction, sequences shorter than 25% of the average length for the gene were eliminated. The number of loci retained from the min04 and min20 datasets was 310 and 348 respectively, with the final aligned length varying between 242,272 bp and 261,975 bp (Appendix S10).

The effect of missing data was tested by comparing two different sampling thresholds based on the minimum number of taxa in each of the target genes alignments (min20 vs. min04, i.e., a minimum of 20 taxa per gene and a minimum of four taxa per gene, respectively) using HybPiper (Johnson et al., 2016) to retrieve and filter the genes.

ML analyses were run on both TC datasets in IQ-TREE2 (Minh et al., 2020b) with partitioning between loci. In addition, IQ-TREE2 was used to generate individual ML trees for each loci, and the resulting phylogenetic trees were used for coalescent analyses with ASTRAL-III version 5.7.3 (Appendix S9; Zhang et al., 2018), where tree nodes with <10% BS values were collapsed using Newick Utilities version 1.5.0 (Junier and Zdobnov, 2010). Trees with excessively long branches were identified using phyx (Brown et al., 2017) by looking at tree lengths and root-to-tip variation (command “pxlstr”); seven gene trees with excessively long branches were identified and excluded for the min20 and ten for the min04 datasets, leading to a total of 303 and 338 gene trees being used for the respective coalescent analyses. Branch support was assessed using local PP support (Sayyari and Mirarab, 2016) calculated in ASTRAL-III, where PP values >0.95 were considered strong, 0.75 to 0.94 weak to moderate, and ≤0.74 as unsupported.

Discordance analyses

Comparison of resulting species trees

Topological congruence and discordances between all 10 topologies generated were assessed visually by generating graphical representations through custom R-scripts using the following packages: “ggtree” (Yu, 2020), “stringr” (Wickham and Wickham, 2019), “ape” (Paradis and Schliep, 2019), “ggplot2” (Villanueva and Chen, 2019) and “gridExtra” (Auguie, 2017). To facilitate comparisons, all trees were reduced to include the outgroup Jaltomata and 9 taxa representing the following clades of Solanum, which were recovered in all analyses: Thelopodium, Regmandra, Potato, Morelloid (as a representative of both the Dulcamaroid and Morelloid clades), Archaesolanum, S. anomalostemon S.Knapp & M.Nee (species sister to Clade II), Acanthophora (minor clade of the Leptostemonum) and two representatives of the EHS clade (Table 1). The species sampled in the PL and TC datasets were identical for all except three minor clades, in which different closely related species were sequenced (Acanthophora: S. viarum Dunal/S. capsicoides All.; Morelloid: S. opacum A.Braun & C.D. Bouché/S. americanum Mill.)

Concordance factors

Phylogenomic discordance was measured using gene concordance factors (gCF) and site concordance factors (sCF) calculated in IQ-TREE2 (Minh et al., 2020a). These metrics assess the proportion of gene trees that are concordant with different nodes along the phylogenetic tree and the number of informative sites supporting alternative topologies. Low gCF values can result from either limited information (i.e., short branches) and/or genuine conflicting signal; low sCF values (~30%) indicate lack of phylogenetic information in loci (Minh et al., 2020a). The metrics were calculated using the TC-min20-ASTRAL-III min20 topology (303 genes) and the PL IQ-TREE2 topology of 151 species (unpartitioned) where sampling was reduced to 21 and 34 tips in TC and PL topologies, respectively, retaining a single tip for each of the different minor and major clades. An additional tip was retained for the EHS Clade to visualize the gCF and sCF for the crown node of that lineage.

Network analyses and polytomy tests

The presence of reticulate evolution and conflicting signals in gene trees in the TC dataset was explored by generating a filtered supertree network in SplitsTree 4 (Huson and Bryant, 2006) of the TC min20 dataset (303 genes) collapsing branches with <75% local PP support with a minimum number of trees set to 50% (151 trees). Polytomy tests were carried out in ASTRAL-III (Sayyari and Mirarab, 2018), using the ASTRAL-III topologies of the two datasets (min20 and min04). Gene trees were used to infer quartet frequencies for all branches to determine the presence of polytomies while accounting for ILS. The analysis was run twice to minimize gene tree error.


Phylogenetic analyses

Congruent recovery of major clades

All three datasets, including the supermatrix and the two phylogenomic datasets (PL and TC), recovered previously recognized major clades in Solanum (Figures 1 and 2AC); a few minor clades, concentrated in Clade II, were found to be polyphyletic in the supermatrix phylogeny, including the Mapiriense-Clandestinum, Sisymbriifolium, Wendlandii-Allophyllum and Cyphomandropsis minor clades (Appendices S11 and S12); comparison with PL and TC phylogenies is not possible, as only one species of each clade were sampled in these datasets. In Clade I, nearly all specimens of the Dulcamaroid clade formed a monophyletic group. The only exception concerned S. alphonsei Dunal, sampled here for the first time. In both the supermatrix and PL analyses, this species was sister to S. valdiviense of the Valdiviense clade, with maximum branch support in the PL analyses (Figure 2, Appendix S13).

Details are in the caption following the image
Supermatrix phylogeny from Maximum Likelihood analysis (RaxML) of 742 Solanum species based on two nuclear and seven plastid regions. Bootstrap branch support values are color-coded: black = strong (0.95–1.0), white = moderate to weak support (0.75–0.94). Dashed lines in phylogeny indicate relationships that were not recovered in the TC and PL analyses (see Figures 2, 3). Clade names refer to major and minor clades discussed in the text (see Table 1).
Details are in the caption following the image
Comparison of Solanum clades recovered in plastome (PL) and target-capture (TC) phylogenomic datasets. (A) Plastome phylogeny from the unpartitioned maximum likelihood analysis (PL-151-UP) based on 160 loci representing exons, introns and intergenic regions; (B) Filtered supertree network of the TC dataset (min20) based on 303 gene trees with a 50% minimum tree threshold. (C) TC phylogeny with 40 species from coalescent analysis (TC-min20-ASTRAL-III). Clades are shown in the same color in all three phylogenies to enable comparison. Branch support values (BS values in (A) and local PP values in (C)) are color coded: black = strong (0.95–1.0), white = moderate to weak (0.75–0.94). Scale bars = substitutions/site. Collection or GenBank numbers are indicated in the PL phylogeny for duplicate species sampled in the phylogenetic trees.

Despite these minor novelties, all analyses recovered the Thelopodium clade as sister to the rest of Solanum (Figures 1 and 2; Appendices S11S15). The Potato clade was strongly supported across all analyses (Figures 1 and 2; Appendices S11S15), as was the Regmandra clade in supermatrix and PL analyses (only one sample in TC phylogenies). Furthermore, all analyses recovered a clade here referred to as DulMo that includes the Morelloid and Dulcamaroid clades (Figures 1 and 2; Appendices S11S15). A new strongly supported clade, here referred to as VANAns clade and comprising the Valdiviense (including S. alphonsei, see below), Archaesolanum, Normania, and the African non-spiny clades, was found across all analyses (Figures 1 and 2; Appendices S11 to S15).

Clade II was supported as monophyletic across all topologies (Figures 1 and 2A, C), with maximum branch support in all 10 species trees (Appendices S11 to S15). While differences in sampling prevent thorough comparisons of relationships between clades within Clade II, there was no deep incongruences detected amongst topologies obtained with the supermatrix, PL, and TC datasets (Figures 1 and 2A, C; Appendices S9S15). Within Clade II, the large Leptostemonum clade (the spiny solanums) was strongly supported in all cases (Figures 1 and 2A, C; Appendices S11S15).

Incongruent relationships amongst clades and impact of different analyses

Overall, we found that despite using different phylogenetic analyses and investigating the impact of missing data and taxon sampling on the different datasets, these had little impact on the relationships recovered amongst clades. The BI and ML supermatrix analyses were identical in terms of composition and relationships of major clades (Figure 3B), as were the four PL species trees (Figure 3D, E). There were some differences amongst the topologies of the TC datasets, but these differences concerned branches which had little support (Figure 3A–C). Between supermatrix, PL and TC datasets, however, major incongruences between species trees were observed with respect to the relationships among the main clades identified in the section above (Figures 1, 3).

Details are in the caption following the image
Comparison of Solanum clades recovered in the three different datasets. (A) TC ASTRAL-III phylogeny of the min20 dataset, with local posterior probabilities indicated at nodes; (B) ML and BI phylogenies of supermatrix dataset, with bootstrap support and posterior probabilities indicated at nodes; (C) TC ML phylogeny of the min20 dataset, with local posterior probabilities indicated at nodes; (D) PL ML phylogenies of the unpartitioned and best partition-scheme of the 151 taxa dataset, with bootstrap for each respective analysis is indicated at nodes; (E) TC ML phylogeny and ASTRAL-III phylogeny of the min04 dataset, with bootstrap support and local posterior probabilities indicated at nodes; (F) PL ML phylogenies of the unpartitioned and best partition-scheme of the 125 taxa dataset, with bootstrap for each respective analysis indicated at nodes.

While the BI and ML supermatrix phylogeny supported the monophyly of the previously recognised Clade I that includes most non-spiny Solanum clades (Figure 1; Appendices S11 and 12), the PL and TC phylogenetic trees resolved clades associated with Clade I as a grade relative to Clade II (Figure 2A, C; Appendices S13S15). This was due in large part to the unstable position of the Regmandra clade that was subtended by a particularly short branch and resolved in different positions along the backbone in all three datasets (Figure 3). For example, the ML supermatrix analysis recovered the Regmandra clade as sister to the Potato clade with strong to moderate branch support (Figure 3B), although the BI supermatrix analysis could not resolve whether the Regmandra clade was sister DulMo + VANAns clade or the Potato clade (Figure 3B, Appendix S12). In contrast, the PL analyses resolved Regmandra as sister to the M clade + Clade II, with either maximal or no branch support at all (Figure 3). The TC species trees resolved Regmandra as sister to the Potato clade, DulMo, and Clade II, with maximum support (Figure 3). While one of the TC ASTRAL-III analysis also recovered this topology with moderate support (local posterior probability 0.82, Figure 3), the other TC ASTRAL-III analysis resolved Regmandra as sister to the VANAns clade, but without any branch support (local PP 0.4, Figure 3).

The previously identified M Clade composed of the VANAns and DulMo clades were not supported by all analyses (Figure 3). While all PL ML analyses recovered the M clade with maximum BS values (Figure 3), none of the TC analyses recovered it. Instead, they resolved the DulMo clade as sister to the Potato clade, with maximal BS or local PP support values (Figure 3). Furthermore, the VANAns clade was recovered as sister to the rest of Solanum (excluding the Thelopodium clade) with moderate support in the TC ML analyses. Placement of the VANAns clade in the TC ASTRAL-III analyses had low or no support value, being resolved as either sister to DulMo, or sister to the rest of Solanum, excluding the Thelopodium clade (Figure 3).

In addition, the position of the Potato clade within Solanum was incongruent between datasets, i.e., whereas it was resolved as sister to Regmandra in the supermatrix analysis, it was resolved as sister to the remaining Solanum in PL dataset, and sister to the DulMo clade in all TC analyses (Figure 3), all with strong branch support. The phylogenomic datasets also showed incongruent positions for the Etuberosum clade within the larger Potato clade, where TC analyses resolved it as sister to the Petota clade with maximum local PP support in the ASTRAL-III analyses (Appendix S15); in the ML analyses, this position either had moderate BS values (76%) or was found to be nested within the Petota clade with no branch support (Appendix S14). In contrast, PL analyses placed Etuberosum clade as sister to the Tomato clade with maximum branch support (Appendix S13).

Finally, the BI and ML supermatrix phylogenies resolved the morphologically unusual S. anomalostemon as sister to the rest of Clade II (BS 95%, PP 1.0; Figure 3, Appendices S11 and S12). This contrasts with results from previous analyses, which found it to be part of the Mapiriense clade (Särkinen et al., 2015). PL analyses supported S. anomalostemon + Brevantherum clade as sister to the rest of Clade II with high branch support (Appendix S13). Solanum anomalostemon was also found to be sister to Clade II, although the Brevantherum clade was not included in the TC analyses preventing a strict comparison (Figure 3). Two other taxa were found to represent single species lineage: S. polygamum Vahl as sister to the Leptostemonum clade and S. euacanthum Phil. as sister to the EHS clade (Appendices S11 and S12). Within the Leptostemonum clade, the EHS clade was strongly supported in all analyses (Figures 1, 3). There were however some minor differences in species-level relationships for closely related species of the Eggplant clade and Anguivi Grade (viz. S. campylacanthum Hochst. ex A.Rich., S. melongena L., S. linnaeanum Hepper & P.-M.LJaeger, S. dasyphyllum Schum. & Thonn., and S. aethiopicum L.; Figures 1 and 2AC; Appendices S11S15).

Discordance analyses

Concordance factors

Phylogenomic discordance was generally high across the PL and TC topologies, with gCF values >50% in only three nodes in the PL phylogeny (Solanum as a whole, S. chilense (Dunal) Reiche + S. lycopersicum L. or the Tomato clade, and S. hieronymi Kuntze + S. aridum Morong in the Leptostemonum clade; Figure 4). Elsewhere, along the backbone of the PL phylogeny, gCF fell to 39% and below (8 nodes with gCF values 10% and below), with the lowest values found near branch nodes that varied the most amongst the different reconstructed species trees. This included the node subtending Regmandra (gCF 4%, SCF 38%; Figure 4), and that positioning Regmandra + DulMo + VANAns clade as sister to Clade II (gCF 2%, SCF 31%). Similarly, low gCF and uninformative sCF values around 33% were found across Clade II, including the node placing S. hieronymi + S. aridum as sister to the Elaeagnifolium + EHS minor clades (gCF 6%, sCF 36%; Figure 4), as well as the placement of the Erythrotrichum + Thomasiifolium clades within the large Leptostemonum clade (gCF 5%, sCF 23%; Figure 4).

Details are in the caption following the image
Discordance analyses within and between the plastome (PL) and target capture (TC) phylogenomic datasets across Solanum. Rooted TC ASTRAL-III phylogeny (left) and PL IQ-TREE2 phylogeny (right) with gene concordance factor (gCF) and site concordance factor (sCF) values shown as pie charts, above and below each node respectively; the PL topology is the unpartitioned ML analysis of 151 taxa, whereas the TC topology is based on the analysis of 40 taxa and 303 genes recovered from the A353 bait set. Both trees have been pruned to retain a single tip for each of the major and minor clades present within the PL and TC datasets. For gCF pie charts, blue represents proportion of gene trees concordant with that branch (gCF), green is proportion of gene trees concordant for 1st alternative quartet topology (gDF1), yellow support for 2nd alternative quartet topology (gDF2), and red is the gene discordance support due to polylphyly (gDFP). For the sCF pie charts: blue represents proportion of concordance across sites (sCF), green support for 1st alternative topology (quartet 1), and yellow support for 2nd alternative topology (quartet 2) as averaged over 100 sites. Percentages of gCF and sCF are given above branches, in bold. Branch support (local posterior probability) values ≥0.95 are not shown, and 0.94 and below are shown in italic grey, on the right; double-dash (--) indicates that the branch support was unavailable due to rooting of the phylogenetic tree.

Across the TC phylogeny, gCF and sCF values were slightly higher on average, with 3 nodes presenting values >50% for both metrics, i.e., one within the Petota clade (gCF 67%, SCF 69%; Figure 4), one at the base of the Leptostemonum clade (gCF 64%, SCF 72%; Figure 4), and another at the base of the EHS clade within Leptostemonum (gCF 58%, SCF 75%; Figure 4). Three nodes had low gCF values of 10% or less, with again some of the lowest values located near the base of the tree, including the relationship of Regmandra as sister to the VANAns clade (gCF 3%, sCF 39%; Figure 4), or placement of Potato as sister to the DulMo clade (gCF 10%, sCF 41%; Figure 4), and the relationship of the Potato + DulMo clades as sister to Clade II (gCF 4%, sCF 41%; Figure 4).

Network analyses and polytomy tests

High amount of reticulation/gene tree conflict was recovered between major clades of Solanum previously assigned to Clade I (e.g., Thelopodium, Regmandra, Potato, DulMo, VANAns), as well as with some lineage belonging to Clade II in the filtered supertree network using the TC data with 303 genes (min20; Figure 2B). The network clearly supported the monophyly of the Leptostemonum and the EHS clade (Figure 2B), corresponding to the nodes with high gCF and sCF values in the TC ASTRAL-III phylogeny (N1 and N2, Figure 4).

The polytomy tests carried out for the two TC ASTRAL-III datasets resulted in 10 nodes each for which the null hypothesis of branch lengths equal to zero was accepted, suggesting they should be collapsed into polytomies (Appendix S16); these nodes corresponded to the ones subtending the Regmandra, Leptostemonum and EHS clades, but were also located within the VANAns clade as well as within Clade II, the. Polytomies were also detected with the Petota clade, including at the base of the Tomato clade (min04 dataset, Appendix S16), and at the base of the Etuberosum + Petota + Tomato clade (min20 dataset, Appendix S16). Repeating the analysis by collapsing nodes with <75% local PP support led to the collapse of 12 to 13 nodes across the analyses, most of them affecting the same clades as in the previous runs, but also leading to the collapse of the crown node of Solanum. The effective number of gene trees was too low when nodes with <75% local PP support were collapsed to carry out the test for two nodes subtending S. betaceum and S. anomalostemon, most likely related to the low number of genes recovered for S. betaceum (Appendix S3).


The results of the ten phylogenetic analyses conducted here provide an updated evolutionary framework for the large and economically important genus Solanum, demonstrating that the major and minor clades within the group are stable (with a few noteworthy exceptions, see below). However, the strong levels of nuclear and nuclear-plastome discordance uncovered in the PL and TC analyses, in combination with the network analysis and polytomy tests, suggest that there are polytomies present along the backbone of the phylogeny. We first discuss the stability of the clades within Solanum, and the discovery of a few novel minor clades. We then examine the nuclear-plastome discordance and polytomies recovered and explore the possible causes underlying these, and their implications for the study of biogeography and trait evolution.

Updated evolutionary framework for Solanum

The supermatrix phylogeny, despite being based on only nine loci, nearly doubles the species sampling, confirming the monophyly of most major and minor clades established in previous analyses (Särkinen et al., 2013) and the polyphyly of three minor clades (Pachyphylla, Cyphomandropsis, and Allophyllum, the latter including species of Mapiriense-Clandestinum clade). It also reveals three new minor clades in Solanum comprising a single species each and confirms the placement of 129 previously unsampled species (e.g., S. alphonsei in the Valdiviense clade and S. graveolens Bunburry in the Cyphomandra clade; Appendices S11 and S12). Meanwhile, the phylogenomic analyses with increased gene sampling reveal a previously undetected major clade referred to as VANAns comprising of four minor clades (Valdiviense, Archaesolanum, Normania, and African non-spiny clades). Finally, our results did not support two previously resolved major clades due to nuclear-plastome discordance (Clade I and the M clade; Figure 2). Detailed molecular systematic studies with increased taxon and genetic sampling will be required to fully resolve the circumscription of all the major and minor clades recovered with diagnostic features, including the new ones identified here (Hilgenhof et al., unpublished manuscript).

Overall, our results establish that the taxonomic framework used in Solanum dividing the large genus into major and minor clades is robust, based on both phylogenomic datasets recovering the same major clades independent of methodological choices compared to the Sanger sequence supermatrix (e.g., Thelopodium, Regmandra, Potato, DulMo, VANAns, Clade II, Leptostemonum, and EHS clade). The major and minor clades currently used as informal infrageneric groups in Solanum were first established by Bohs (2005) based on a single locus of c. 2000 bp in length (ndhF). Our results demonstrate that larger species and gene sampling support the clades established earlier (e.g., Weese and Bohs, 2007; Särkinen et al., 2013). However, increased gene sampling provided by the two phylogenomic datasets does not help to resolve any of the polytomies along the backbone of Solanum close to the crown node and along the backbone of Clade II (Särkinen et al., 2013).

Nuclear and nuclear-plastome discordance

Our results reveal three regions of the Solanum phylogeny with gene discordance with low gCF and sCF values in the PL and TC dataset (Figure 4). These regions with nuclear discordance include: (1) the backbone of Solanum near the crown node of the genus where major clades previously identified as Clade I diverge (from here on referred to as Grade I); (2) the backbone of the large Leptostemonum clade; and (3) the backbone of the EHS clade within the Leptostemonum (Figures 2B and 3). Many of the branches within these regions are extremely short in both PL and TC phylogenomic datasets (Figures 1 and 2; Appendices S11S15), and network analyses of the nuclear dataset reveals reticulation in one of them (Grade I, Figure 2B). Polytomy tests confirm that multiple nodes within all three regions should be collapsed in the TC dataset (Appendix S16) and support the recognition of these regions as polytomies. Hence, we refer to these three regions of the phylogeny as polytomies from hereon.

Further exploration of the polytomies reveal nuclear-plastome discordance within Grade I, relating to the position and relationship between Regmandra, Potato, DulMo and VANAns clades (Figures 3 and 4). No signal of nuclear-plastome discordance was detected in the other polytomies based on the species sampling presented here (Figures 3 and 4), but increased species sampling will be needed to confirm these results.

Altogether, our results indicate the presence of three polytomies which differ somewhat in nature. The deepest of these polytomies along the backbone of Solanum near the crown node shows high nuclear and nuclear-plastid discordance with reticulation evident even within the nuclear phylogenomic dataset (Figure 2B). This polytomy could be referred to as a hard polytomy because it will probably be difficult to resolve even with more genomic data, due to its deeper position in the phylogeny in terms of evolutionary depth and time, the presence of clear nuclear-plastome discordance, short branch lengths and evidence for reticulation within the nuclear phylogenomic dataset. In contrast, the other two polytomies along the backbone of Leptostemonum and the EHS clades are at shallower evolutionary depth and show nuclear discordance only without clear/widespread reticulation in the nuclear dataset (Figure 2B). These polytomies represent simpler cases and may turn out to be possible to resolve with more genomic data. In either case, to confirm whether the polytomies recovered here are truly “hard” or “soft”, denser taxon sampling and more genomic data will be required to carry out more rigorous tests concerning the cause of the gene discordance observed here.

What is causing genomic discordance in our dataset?

Finding genomic discordance in our phylogenomic datasets is unsurprising, given that it has also been found in many other phylogenomic studies in the Solanaceae, including Nicotiana (Dodsworth et al., 2020), the Capsiceae (Capsicum and relatives; Spalink et al., 2018), subtribe Iochrominae (Gates et al., 2018), Jaltomata (Wu et al., 2019), and two studies of Solanum involving the Tomato (Strickler et al., 2015; Pease et al., 2016) and Petota clades (Huang et al., 2019). ILS was shown to be responsible for the widespread discordance found in phylogenomic data in the diploid Tomato clade (Strickler et al., 2015; Pease et al., 2016), while hybridization and introgression has been argued to be behind genomic discordance in Petota clade that includes many polyploids (Huang et al., 2019).

Potential processes responsible for nuclear or nuclear-plastome discordance involve gene introgression, ILS, hybridization, and polyploidization; distinguishing between these remains difficult even with increased genomic sampling involving custom bait sets (Larridon et al., 2020; Koenen et al., 2021) or whole genome-sequences (Suh, 2016; Malinsky et al., 2018; Williams et al., 2021). Comparison of the nuclear and plastome topologies in our study does not indicate any obvious chloroplast capture events that could explain the observed nuclear-plastome discordance along the backbone of Solanum near the crown node. Furthermore, cytogenetic and chromosome studies show no evidence for genome duplication or polyploidy along the three polytomies discovered here, despite the three-fold increase in genome size between the distantly related potato (S. tuberosum L., Potato clade) and eggplant (S. melongena, Leptostemonum clade; Barchi et al., 2019). Chromosome counts indicate that the ancestor of Solanum was diploid, i.e., a large majority of Solanum species are reported to be diploid (>97% of the 506 species for which chromosome counts are available), and mapping of ploidy level across the phylogeny indicates that most of the lineages involved in the three polytomy regions identified here are diploid (Chiarini et al., 2018). Polyploidy has arisen independently within the Archaesolanum, Petota, Morelloid, Caroliniense, Elaeagnifolium, and EHS minor clades within the larger Leptostemonum clade (Chiarini et al., 2018), and hybridization/introgression has been argued to be the case behind phylogenomic discordance found in the Petota clade (Huang et al., 2019). Gene duplication could explain the signal recovered here for the EHS clade but is unlikely to explain the discordance observed here. Save for one locus, our analyses did not detect the presence of paralogs in our nuclear dataset.

Currently, the most likely explanation for the discordance along the backbone of Solanum is due to ILS caused by rapid speciation. Two of the polytomies include the most species-rich (Table 1) and rapidly diversifying lineages of Solanum, the Leptostemonum and the EHS clades (Echeverría-Londoño et al., 2020), whose crown ages have been estimated to be between 8 to 11 and 4 to 6 million years (Myr), respectively (Särkinen et al., 2013). The backbone of Solanum near the crown node has been estimated to be almost twice as old as the Leptostemonum clade (13 to 17 Myr; Särkinen et al., 2013) yet shows a strong signal of nuclear-plastome discordance. While past studies have not detected any increased rates of diversification near the crown node of Solanum, detecting diversification rate shifts remains a challenge (Louca and Pennell, 2020), especially in older nodes. Hence, we cannot fully exclude the option that ILS and rapid speciation has taken place close to the crown node of the genus.

Presence of short internal branches is typical of ILS in lineages with large population sizes and high mutation rates (Schrempf and Szöllősi, 2020). This fits with the biology of Solanum in general, which is typically known to contain “weedy”, disturbance-loving pioneer species resilient to change. Many species are known to have large geographical ranges and ecological amplitude, including globally distributed weeds from the Leptostemonum, Brevantherun and Morelloid clades, such as S. elaeagnifolium Cav., S. caroliniense L., S. torvum Sw., S. erianthum D.Don, S. mauritianum Scop., S. americanum, and S. nigrum L. (Knapp et al., 20172019; Cowie et al., 2018; Särkinen et al., 2018). Some of the weedy characteristics found in these species include the ability to improve fitness and defense traits in response to disturbance (Chavana et al., 2021), as well as having allelopathic properties which allow them to establish themselves to the detriment of native vegetation (Cowie et al., 2018). If such characteristics were present in ancestral Solanum, they could have promoted rapid speciation across the globe, followed by rapid morphological evolution and speciation within areas. The patterns observed here could possibly be the result of three major rapid speciation “pulses” across the evolutionary history of Solanum, involving lineages close to the crown node of Solanum, Leptostemonum, and the EHS clade. The idea of an ecologically opportunistic ancestor is supported by the tendency of many of the major clades near the crown node of Solanum to occupy periodically highly stressed and disturbed habitats, including flooded varzea forests occupied by Thelopodium clade, hyper-arid deserts occupied by Regmandra clade, and highly disturbed and dynamic open mid-elevation Andean montane habitats occupied by DulMo clade, where landslides are among the most common areas where many of the species are found (Knapp, 2013; Särkinen et al., 2018; Knapp et al., 2019).

Future studies with larger datasets will be able to carry out additional tests, such as the impact of using phylogenetic models that take into consideration the heterogeneity of molecular sequence evolution (Williams et al., 2021), as well as different data types (Romiguier et al., 2013; Reddy et al., 2017). Future studies will need to untangle how introgression and ILS are potentially affecting the patterns of genomic discordance observed here at different phylogenetic depths (Meleshko et al., 2021). Additional information about recombination, chromosome structure, and genomic size and evolution of Solanum will also be useful to clearly define coalescence genes in phylogenomic datasets, fundamental units in coalescent analyses which are rarely examined (Springer and Gatesy, 2018). Currently, information about genome evolution in Solanum is lacking, as only 62 species (5% of Solanum) are recorded in the plant DNA C-value database (Pellicer and Leitch, 2020), and 86 species (7% of Solanum) have been studied with chromosome banding and/or FISH techniques (Chiarini et al., 2018). Information about genome size is missing for lineages such as the Thelopodium and Regmandra clades and for the majority of species not directly related to major commercial crops.

Implications for biogeographical and morphological studies in Solanum

The idea that well-supported and fully bifurcating phylogenies are a requisite for evolutionary studies is built on the premise that such trees are the accurate way of representing evolution. The shift in systematics from “tree”- to “bush”-like thinking, where polytomies and reticulate patterns of evolution are considered as acceptable or real (Poczai, 2013; Mallet et al., 2016; Edelman et al., 2019), comes from the accumulation of studies finding similar unresolvable phylogenetic nodes, despite using different large-scale genomic sampling strategies and various analytical methods (Suh, 2016). Given the difficulty of resolving short internal branches in phylogenies and the rapid evolution of major clades in Solanum, it will be important to adopt methods that incorporate polytomies and networks to conduct biogeographical and morphological studies (Than et al., 2008; Solís-Lemus et al., 2017; Wen et al., 2018; Olave and Meyer, 2020; Lutteropp et al., 2021 [Preprint]).

In terms of biogeography, our inability to resolve relationships amongst the major lineages in Solanum, especially along the backbone of Solanum near the crown node, has implications for understanding the ancestral environment of Solanum and its major lineages. Uncertainty amongst the relationships of major clades does not change the hypothesis that the genus probably originated from South America and spread multiple times to Africa, Asia, Australia, North America, and Europe (Olmstead and Palmer, 1997; Echeverría-Londoño et al., 2020). The polytomy near the crown node of Solanum does, however, cast uncertainty on the specific region and habitat/biome that the major clades originated within the South American continent. For example, the sister relationship of Regmandra and the Potato clade inferred by the Sanger supermatrix analysis suggests that the wild ancestors of both potato and tomato evolved from an ancestor adapted to survive in lomas deserts from coastal South America (Bennett, 2008; Figure 1). Yet, both nuclear and plastome phylogenomic datasets suggest that the Potato clade is more closely related to the DulMo clade found to occur in tropical montane and subtropical biomes (Figure 3).

The hard polytomy along the backbone of Solanum also has important implications for evolutionary biologists interested in trait evolution. Standard methods of trait evolution relying on bifurcating trees may incorrectly infer how traits evolve (Hahn and Nakhleh, 2016). The discordance between traits, gene trees, and species trees has been defined as hemiplasy (Avise and Robinson, 2008), and studies have shown that depending on the level of ILS present in the data, hemiplasy can lead to different interpretations of convergent evolution of traits across phylogenetic trees (Mendes et al., 2016). While broad mapping of morphological traits on a species-level phylogeny can help gain a rough understanding of phenotypic variation across clades, careful study of gene tree topologies in relation to a trait of interest is essential to gain an exact understanding of its evolutionary origin.

Our findings reflect results from recently published studies showing rapid morphological innovation coinciding with areas of strong phylogenomic discordance in different plants and animal groups (Parins-Fukuchi et al., 2021), where the signal of nuclear-plastome discordance corresponds to strong ecological diversification and morphological innovation across major clades in Solanum previously assigned to Clade I. The major clades involved in the nuclear-plastome discordance along Grade I show large differences in their ecology as well as morphology. Members of the Thelopodium, Regmandra, VANAns, Potato, and DulMo clades occupy a wide range of tropical, montane, and temperate habitats across South America, Africa, and Australia (Symon, 1994; Knapp, 2000; Bohs and Olmstead, 2001; Spooner et al., 200420162019; Bohs, 2005; Peralta et al., 2007; Bennett, 2008; Knapp, 2013; Knapp and Vorontsova, 2016; Tepe et al., 2016; Särkinen et al., 2018; Knapp et al., 2019). Morphology shows equally high polymorphism between these major clades across many traits, such as growth form, which varies from single-stemmed wand-like shrubs (Thelopodium clade), annual herbs (Regmandra, Potato, and Morelloid clade), woody climbers and shrubs (VANAns clade), and herbaceous vines rooting along nodes (Potato clade). Similar patterns are observed in inflorescence position and branching, corolla shape, stamen dimorphism, and anther shape showing the presence of high polymorphism in these clades of which only some was retained in Clade II (Hilgenhof et al., unpublished manuscript). Testing the idea that this phenotypic diversity is linked to ecological diversification will require the construction of detailed morphological and ecological datasets to test if this pattern holds up in more formal and rigorous analyses.


We demonstrate the stability of the majority of the clades defined within Solanum and uncover significant nuclear and nuclear-plastome discordance amongst relationships of major clades in Solanum based on the first phylogenomic study of the genus with wide species sampling. Three major polytomies are identified in Solanum based on the short branch lengths, gene concordance factor results, and polytomy tests. Two of these polytomies correspond to the biggest and most quickly diversifying lineages within Solanum (Leptostemonum and EHS clades). The third polytomy along the backbone of Solanum near the crown node involves reticulation and strong nuclear-plastome discordance and highlights great uncertainty in the relationships between the Potato, DulMo, Regmandra, and VANAns clades. This region of nuclear-plastome discordance corresponds with high ecological and morphological innovation and we argue that it is most likely due to ILS and rapid speciation based on current knowledge of genome evolution in Solanum. Future studies, even with full genome sequences and increased taxon sampling, might not be able to resolve the polytomy near the crown node of Solanum because the pattern of high reticulation combined with internodal short branches and its older age. Data on genome size and chromosome structure of the earliest branching lineages in Solanum will be required to further explore the nature and causes of this hard polytomy. We argue that acknowledging and embracing polytomies and reticulation is crucial if we are to design research programs aimed at understanding the biology of large and rapidly radiating lineages, such as the large and economically important Solanum.


We thank Elliot Gardner for sharing scripts and advice on phylogenomic analyses with HybPiper, Royce Steeves for providing advice on DNA extraction for genome skimming, Felix Forest and Olivier Maurin for providing technical support and providing feedback on the manuscript, and João R. Stehmann, Thais Almeida, Paul Gonzáles, and Maria Baden who greatly contributed to fieldwork and sample acquisition. Finally, we would also like to thank the three reviewers, including Stacey Smith and William J. Baker, who provided constructive reviews and feedback that greatly improved the final version of this manuscript.


    This work was supported by the Fonds de recherche du Québec en Nature et Technologies postdoctoral fellowship and a grant from the Department of Biological Sciences of the University of Moncton to E.G., the Sibbald Trust fellowship to R.H., the Ceiba Foundation to A.O., CNPq Conselho Nacional de Desenvolvimento Científico e Tecnológico awards 479921/2010-54 and 427198/2016-0 and Coordenação de Aperfeiçoamento de Pessoal de Nível Superior CAPES/FAPESPA award 88881.159124/2017-01 to L.L.G., NSF through grant DEB-0316614 “PBI Solanum: a worldwide treatment” to S.K. and L.B., the Calleva Foundation & Sackler Trust (Plant and Fungal Trees of Life Project at Kew), the LUOMUS Trigger and Systematics Research Fund to P.P., the OECD CRP and Eötvös Research Grant (MAEÖ−00074-002/2021). Field sampling was supported by the Northern Territory Herbarium (Palmerston, Northern Territory, Australia), and the David Burpee Endowment at Bucknell University (Lewisburg, Pennsylvania, USA) and National Geographic Society Northern Europe Award GEFNE49-12 (Peru, TS). Peruvian specimens were collected and sequenced under the permission of Ministerio de Agricultura, Dirección General Forestal y de Fauna Silvestre (collection permits 084-2012-AG-DGFFSDGEFFS and 096-2017-SERFOR/DGGSPFFS, and genetic resource permit 008-2014-MINAGRI-DGFFS/DGEFFS).


    E.G. designed and performed the analyses for the paper, with guidance from P.P., A.O., S.D., and T.S.; E.G. produced all figures, and wrote the manuscript, with major contributions from T.S., as well as P.P., S.D., S.K., and X.A. R.H. and T.S. helped in data gathering and analyses. All other authors contributed data to the main analyses. All authors read and contributed to the final version of the manuscript.


    Raw sequence data generated in this study are deposited in various archives, including GenBank (website: https://www.ncbi.nlm.nih.gov/genbank/) and the European Nucleotide Archive (website: https://www.ebi.ac.uk/ena/browser/home); full accession numbers are provided in Appendices S1S2, and S3. In addition, the 10 species trees generated for this study, as well as the alignments used for the different phylogenetic analyses, including the concatenated Sanger supermatrix, the plastome dataset, and the target capture datasets (min04 and min20) are available via Data Dryad, at the following link: https://datadryad.org/stash/dataset/doi:10.5061/dryad.2v6wwpzpt.