Volume 99, Issue 2 p. 209-218
Hybridization and Introgression
Free Access

Genomics of Compositae weeds: EST libraries, microarrays, and evidence of introgression

Zhao Lai

Zhao Lai

Department of Biology and Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana 47405 USA

Search for more papers by this author
Nolan C. Kane

Nolan C. Kane

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
Alex Kozik

Alex Kozik

Department of Plant Sciences and Genome Center, University of California, Davis, California 95616 USA

Search for more papers by this author
Kathryn A. Hodgins

Kathryn A. Hodgins

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
Katrina M. Dlugosch

Katrina M. Dlugosch

Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721 USA

Search for more papers by this author
Michael S. Barker

Michael S. Barker

Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, Arizona 85721 USA

Search for more papers by this author
Marta Matvienko

Marta Matvienko

Department of Plant Sciences and Genome Center, University of California, Davis, California 95616 USA

Search for more papers by this author
Qian Yu

Qian Yu

Department of Biology and Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana 47405 USA

Search for more papers by this author
Kathryn G. Turner

Kathryn G. Turner

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
Stephanie Anne Pearl

Stephanie Anne Pearl

Department of Plant Biology, University of Georgia, Athens, Georgia 30602 USA

Search for more papers by this author
Graeme D. M. Bell

Graeme D. M. Bell

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
Yi Zou

Yi Zou

Department of Biology and Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana 47405 USA

Search for more papers by this author
Chris Grassa

Chris Grassa

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
Alessia Guggisberg

Alessia Guggisberg

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
Keith L. Adams

Keith L. Adams

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Search for more papers by this author
James V. Anderson

James V. Anderson

Biosciences Research Laboratory, USDA-ARS, 1605 Albrecht Boulevard, Fargo, North Dakota 58105-5674 USA

Search for more papers by this author
David P. Horvath

David P. Horvath

Biosciences Research Laboratory, USDA-ARS, 1605 Albrecht Boulevard, Fargo, North Dakota 58105-5674 USA

Search for more papers by this author
Richard V. Kesseli

Richard V. Kesseli

Biology Department, University of Massachusetts, Boston, Massachusetts, USA

Search for more papers by this author
John M. Burke

John M. Burke

Department of Plant Biology, University of Georgia, Athens, Georgia 30602 USA

Search for more papers by this author
Richard W. Michelmore

Richard W. Michelmore

Department of Plant Sciences and Genome Center, University of California, Davis, California 95616 USA

Search for more papers by this author
Loren H. Rieseberg

Corresponding Author

Loren H. Rieseberg

Department of Biology and Center for Genomics and Bioinformatics, Indiana University, Bloomington, Indiana 47405 USA

Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada

Author for correspondence (e-mail: [email protected])Search for more papers by this author
First published: 01 February 2012
Citations: 73

The authors thank F. Bretagnolle, P. van Dijk, T. Gulya, J. Hierra, R. Hufbauer, L. Kiss, P. Kotanen, Y. Sapir, D. Lavelle, and G. Seiler for assistance in obtaining seed or tissue collections; and M. Stewart, M. Scascitelli, H. Luton, and K. Nurkowski for technical assistance. They also thank the sequencing and bioinformatics teams at the Joint Genome Institute, Genome Quebec, Indiana University's Center for Genomics and Bioinformatics and the David H. Murdock Research Institute for assistance with the generation and processing of the raw sequence data. National Science Foundation Awards 0421630 and 0820451 and Natural Sciences and Engineering Research Council of Canada Award 353026 provided funding.

Abstract

Premise of study: Weeds cause considerable environmental and economic damage. However, genomic characterization of weeds has lagged behind that of model plants and crop species. Here we describe the development of genomic tools and resources for 11 weeds from the Compositae family that will serve as a basis for subsequent population and comparative genomic analyses. Because hybridization has been suggested as a stimulus for the evolution of invasiveness, we also analyze these genomic data for evidence of hybridization.

Methods: We generated 22 expressed sequence tag (EST) libraries for the 11 targeted weeds using Sanger, 454, and Illumina sequencing, compared the coverage and quality of sequence assemblies, and developed NimbleGen microarrays for expression analyses in five taxa. When possible, we also compared the distributions of Ks values between orthologs of congeneric taxa to detect and quantify hybridization and introgression.

Results: Gene discovery was enhanced by sequencing from multiple tissues, normalization of cDNA libraries, and especially greater sequencing depth. However, assemblies from short sequence reads sometimes failed to resolve close paralogs. Substantial introgression was detected in Centaurea and Helianthus, but not in Ambrosia and Lactuca.

Conclusions: Transcriptome sequencing using next-generation platforms has greatly reduced the cost of genomic studies of nonmodel organisms, and the ESTs and microarrays reported here will accelerate evolutionary and molecular investigations of Compositae weeds. Our study also shows how ortholog comparisons can be used to approximately estimate the genome-wide extent of introgression and to identify genes that have been exchanged between hybridizing taxa.

Weedy and invasive plants cause considerable damage to the economy and environment. In North America, direct economic costs due to production losses in agriculture and forestry, as well as the cost of control measures, are estimated at $30–40 billion annually (55, 56; 52; 17). Environmental costs are more difficult to estimate monetarily, but they can be profound and include extinction of species, loss of biodiversity, and degradation of ecosystem services (79; 62; 22). While the ecological and financial damage caused by weeds has stimulated considerable research into the ecology and evolution of weeds and invasive plants, genomic characterization has lagged behind that of model plants or crop species (67).

Recent discussions of weed genomics have argued for the development of one or several model “weed” species that would be suitable for answering questions about weed biology and would serve to concentrate funding and intellectual efforts (8; 13; 67). However, the community of scientists who study weedy and invasive plants is diverse, as are the traits that apparently contribute to the success of weedy species. This diversity is to be expected given that weeds are often broadly defined as “plants that grow in disturbed areas” (30). Thus, it may not be appropriate to focus the efforts of the community of weed and invasive plant researchers on one or even a handful of weeds. The advent of “next-generation,” high-throughput sequencing technologies enables the identification of genetic changes that are frequently associated with the evolution of weedy and invasive plants, as well as those that are idiosyncratic. Such comparative genomic approaches exploit the diversity of weeds and invasive plants to answer questions about the ecological, evolutionary, and molecular mechanisms contributing to their successes.

The Compositae (Asteraceae) family is especially well suited for comparative genomic studies of weed evolution. The Compositae is one of the largest and most successful families of flowering plants (66), with close to 24000 named species that thrive in a great diversity of habitats, including some of the world's most inhospitable. Although the Compositae family contains several hundred economically valuable species (21), it is perhaps best known for its noxious weeds such as thistles, knapweeds, ragweeds, and dandelions. Indeed, the Compositae includes eight of the 20 worst weeds in North America (59). Also, 36 of 181 North American species that have been newly introduced and are potentially invasive in Europe come from the Compositae (26). While the traits associated with successful Compositae weeds vary across taxa (51), herbicide resistance (54) and growth-defense trade-offs (47) are commonly observed in weedy species in the family.

The two most economically important genera of the Compositae (Helianthus and Lactuca) are particularly interesting and complementary with regard to their reciprocal histories of domestication and the evolution of invasiveness. Sunflower was domesticated in North America, yet today 11 of the 49 species in the genus Helianthus (including H. annuus) are considered naturalized or invasive in Europe (18). Also, due to high levels of gene flow between cultivated and weedy sunflowers (4; 45), sunflower has been featured in debates about the role of crop–wild gene flow and transgene escape in the evolution of “super weeds” (12; 24; 64; 5; 20, 54). Conversely, lettuce was domesticated in the Mediterranean region, yet today Lactuca serriola L. (the progenitor of cultivated lettuce) and 11 other wild species of Lactuca have become established in North America (43).

Here we report on the development of genomic tools and resources for 11 Compositae weeds (Table 1): Ambrosia artemisiifolia L. (common ragweed), Ambrosia trifida L. (giant ragweed), Centaurea diffusa Lam. (diffuse knapweed), Centaurea stoebe subsp. micranthos L. (spotted knapweed), Centaurea solstitialis L. (yellow starthistle), Cirsium arvense (L.) Scop. (Canada thistle), Carthamus oxyacanthus M. Bieb. (jeweled distaff thistle), Helianthus annuus L. (common sunflower), Helianthus ciliaris DC. (Texas blueweed), L. serriola (prickly lettuce), and Taraxacum officinale F. H. Wigg (dandelion). Most of these taxa are native to Europe or Central Asia and are invasive in North America or elsewhere. However, three of the targeted weeds have a reciprocal history of invasion: common and giant ragweed and common sunflower are native to North America and naturalized elsewhere (31; 27). While the majority of the target weeds are diploid, outcrossing annuals, there are several exceptions. Prickly lettuce is a selfing annual or biennial (3). Canada thistle is an outcrossing perennial (41). Spotted knapweed and blueweed are perennials and have multiple ploidy levels (31; 10). Dandelions are perennial, have multiple ploidy levels, and produce asexual seeds through apomixis (72).

Table 1. Provenance information for Compositae weeds targeted in this study.
Taxon Common name Collection locality Collection ID
Ambrosia artemisiifolia L. Common ragweed Biatorbagy, Hungary (latitude 47.46, longitude 18.81) HU1-11
Russell, MN, USA (latitude 44.19, longitude −95.57) AA8-20
Ambrosia trifida L. Giant ragweed Jilin, Jilin, China (latitude 43.50, longitude 126.32 GNV8ASA01
Dengta, Liaoning, China (latitude 41.25, longitude 123. 20) GNV8ASA02
Kampsville, IL, USA (latitude 39.16, longitude 90.37) GNV8ASA03
Bloomington, IN, USA (latitude 39.92, longitude 86.31) GNV8ASA04
Carthamus oxyacanthus M. Bieb. Jeweled distaff thistle 38 km north of crossroad just west of Mardin, Turkey PI 407602
Centaurea diffusa Lam. Diffuse knapweed Kirklarelì, Turkey (latitude 41.45, longitude 27.14) DK TR001-1L
Roosevelt, WA, USA (latitude 45.44, longitude −120.12) DK US022-31E
Centaurea stoebe subsp. micranthos L. Spotted knapweed Tetraploid genotype, Boston, MA, USA (latitude 42.29, longitude −71.04). R. Kesseli - Cema #1A
Centaurea solstitialis L. Yellow starthistle Walnut Creek, CA, USA (latitude 37.95, longitude −122.05) R. Kesseli - Ceso JH1 #1
Santa Rosa, Argentina (latitude −37.39, longitude −64.08). AR-13-24
Cirsium arvense (L.) Scop. Canada thistle Female plant, Fargo, ND, USA (latitude 46.93, longitude −96.86) NW-22-1-M
Male plant, Richmond Hill, ON, Canada (latitude 43.95, longitude −79.56) Hodgins KN-ON
Female plant, Lugoj, Romania (latitude 45.65, longitude 21.95) Guggisberg, Bretagnolle & Zeltner 280808-2
Helianthus annuus L. Common sunflower Ma'ayan Tzvi, Israel (latitude 32.33, longitude 34.56) Hann - ISI
Port Augusta, Australia (latitude 32.29, longitude 137.47) SAW3, USDA PI 653594
Helianthus ciliaris DC. Texas blueweed Tetraploid genotype, weed garden of the New Mexico State University Plant Science Research Center in Dona Ana County, NM, USA L. Rieseberg - Hcil 1411
Lactuca serriola L. Prickly lettuce Davis, CA, USA US96UC23
Taraxacum officinale F. H. Wigg Dandelion Apomictic triploid, Heteren, The Netherlands P.J. van Dijk - Taof A68

While all of these weeds are able to colonize disturbed habitats such as cropland, abandoned fields, roadsides, and railroads, they vary in competitive ability and in their damage to the environment and to human health. Dandelion is a major lawn weed across the temperate world (55; 56). Knapweeds and thistles are rangeland weeds that have colonized and degraded millions of hectares of pastures and rangeland in western North America (44). Ragweeds are abundant colonizers of disturbed habitats across much of temperate North America, Europe, Asia, and Australia, and allergens produced by their pollen are the primary cause of hay fever (36; 15). Hay fever costs $3.5 billion per year in the United States alone in direct medical expenditures (68) and more than 10 times as much in lost workplace productivity (42).

The genomic tools and resources that we describe here are intended to serve as the basis for subsequent population and comparative genomic analyses. In addition, we report on the coverage of 454, Illumina, and Sanger cDNA libraries and compare the quality of the assemblies from data generated with these three platforms. Last, we describe evidence that hybridization is associated with the evolution of several of the weeds investigated here and provide a preliminary report on the kinds of genes that appear to have been exchanged between the hybridizing taxa.

MATERIALS AND METHODS

EST library development

Express sequence tag (EST) libraries were developed for one or more accessions of the 11 Compositae weeds targeted by this study (Table 1). RNA was isolated from a variety of tissues (Table 2) using either Trizol reagent (Invitrogen, Carlsbad, California, USA) or RNeasy Maxi (or Mini) kits (Qiagen, Valencia, California, USA), or a combination of the two methods. In the combined approach, the standard Trizol protocol was followed through the chloroform extraction step, then 0.53× volumes of 100% ethanol was added to the aqueous phase, the entire RNA/ethanol mixture was then applied to an RNeasy Maxi (or Mini) column, and the Qiagen protocol followed thereafter. Approximately equal amounts of total RNA isolated from each tissue type were pooled prior to EST library preparation.

Table 2. Plant tissues employed for EST library development.
Taxon Collection ID Roots Leaves Flower buds Mature flowers Fruits or seeds Seedlings Library / Sequence type a
Ambrosia artemisiifolia HU1-11 x N, SS / 454
AA8-20 x N, SS / 454
Ambrosia trifida GNV8ASA01 x x x x N, DS / 454
GNV8ASA02 x x x x N, DS / 454
GNV8ASA03 x x x x N, DS / 454
GNV8ASA04 x x x x N, DS / 454
Carthamus oxyacanthus PI 407602 x x x N, DS / 454
Centaurea diffusa DK TR001-1L x N, DS / 454
DK US022-31E x N, DS / 454
Centaurea stoebe subsp. micranthos Cema #1A x x x x x N, SF / Sanger
Centaurea solstitialis Ceso JH1 #1 x x x x x N, SF / Sanger
AR-13-24 x N, DS / 454
Cirsium arvense b NW-22-1-M x x x x N, DS / 454
Hodgins KN-ON x S / Illumina
Guggisberg et al., 280808-2 x S / Illumina
Helianthus annuus c Several cultivars x x x x x N / Sanger
Hann ISI x N, DS / 454
SAW3 x N / Illumina
Helianthus ciliaris Hcil 1411 x x x x N, SF / Sanger
Lactuca serriola US96UC23 x x x x x N / Sanger
US96UC23 x x x x x N / Illumina
Taraxacum officinale d Taof A68 x x x x x S, SF/ Sanger
  • a N = normalized library; S = standard library; DS = double-stranded libraries; SS, SF = size-fractionated library.
  • b Illumina EST libraries for Cirsium arvense were generated as part of an analysis of gene regulatory evolution and are described in G. Bell et al. (unpublished manuscript)
  • c Sanger EST libraries for H. annuus previously described by 29.
  • d Leaves of Taraxacum officinale were from plants sprayed with salicylic acid (4 mmol/L in 0.1% Triton X-100) or jasmonic acid (50 mmol/L in 0.1% Triton X-100) to induce defense-related gene expression.

Several different methods were used to generate EST libraries as sequencing technologies advanced (Table 2). For Sanger sequencing, we prepared standard libraries using the SMART (Clontech, Palo Alto, California, USA) approach or normalized libraries with the TRIMMER-DIRECT cDNA Normalization Kit (Evrogen, Moscow, Russia). The cDNA samples from both the standard and normalized EST libraries were size-fractionated through agarose gels into three classes (0.5–1 kb, 1–2 kb, and 2–3 kb) to reduce biases due to size during the subsequent cloning and sequencing steps.

For 454 sequencing (454 Life Sciences, Branford, Connecticut, USA), we employed modified oligo-dT primers during cDNA synthesis to reduce the length of mononucleotide runs associated with the poly(A) tail of mRNA. Mononucleotide runs reduce sequence quality and quantity due to excessive light production and crosstalk between neighboring cells. For common ragweed, we used a “broken chain” short oligo-dT primer to prime the poly(A) tail of mRNA during first strand cDNA synthesis (49). cDNA was amplified and normalized with the TRIMMER-DIRECT cDNA Normalization Kit as above. Then normalized cDNA was prepared for sequencing following the standard genomic DNA shotgun protocol recommended by 454 Life Sciences. For cDNA synthesis of the other libraries, we either used the broken chain short oligo-dT primer described above or two different modified oligo-dT primers: one to prime the poly(A) tail of mRNA during first strand cDNA synthesis and another to further break down the stretches of poly(A) sequence during second strand cDNA synthesis (9). We then normalized and amplified the cDNA using the TRIMMER-DIRECT cDNA Normalization Kit as above. After normalization, cDNA was fragmented to 500- to 800-bp fragments by either sonication or nebulization and size-selected to remove small fragments using AMpure SPRI beads (Angencourt, Beverly, Massachusetts, USA). Then the fragmented ends were polished and ligated with adaptors. The optimal ligation products were selectively amplified and subjected to two rounds of size selection including gel electrophoresis and AMpure SPRI bead purification (40).

For Illumina sequencing, we prepared standard libraries using the mRNA-Seq (Illumina, San Diego, California, USA) approach or normalized libraries using customized approaches. For L. serriola, cDNA was synthesized using the mRNA-Seq cDNA Synthesis Kit (Illumina) prior to normalization with the TRIMMER-DIRECT cDNA Normalization Kit (M. Matvienko et al., unpublished). For the remaining libraries sequenced with Illumina (Table 2), cDNA was synthesized using the SMART PCR cDNA Synthesis Kit (Clontech, Palo Alto, California, USA) and then normalized with the TRIMMER-DIRECT Kit. The normalized libraries were then prepared for sequencing as recommended by Illumina. After determination of fragment size distributions on a Bioanalyzer (Agilent Technologies, Santa Clara, California, USA) and of concentrations with PicoGreen (Invitrogen), libraries were diluted for real-time quantitative PCR and sequenced.

Processing and assembly of EST libraries

The Sanger EST libraries were sequenced using ABI 3730 machines (Life Technologies, Carlsbad, California, USA) at the Joint Genome Institute in Walnut Creek, California. Phred base calling, masking, trimming, and CAP3 assemblies (32) were conducted using the CGPdb bioinformatic pipelines (http://compgenomics.ucdavis.edu/index.php?link=tools; Compositae Genome Project, Genome Center, University of California-Davis). While the present paper provides the first published description of the development of these libraries and the accessions employed, the Sanger ESTs reported here were previously included in assemblies reported by 7.

The 454 EST libraries were sequenced on GS-FLX machines (454 Life Sciences) at the Indiana University Center for Genomics and Bioinformatics (http://cgb.indiana.edu/), the David H. Murdock Research Institute (DHMRI; http://www.dhmri.org/about.html), or Genome Quebec (http://www.genomequebec.com/v2009/home/) using the standard 454 Titanium chemistry. The 454 sequences were cleaned using the program SnoWhite version 1.1.4 (http://evopipes.net/snowhite.html) (6) or the program ESTclean (http://sourceforge.net/projects/estclean/). Cleaned sequences were initially assembled with the program MIRA version 3.0 (16), using the “accurate,est,denovo,454” assembly mode. However, because in our experience, MIRA can be too aggressive in splitting up contigs with high coverage, we took the MIRA contigs and singletons and reassembled them with the program CAP3 at 94% identity (32).

The Illumina EST libraries were sequenced on Illumina GAII machines at the UC Davis Genome Center (http://www.genomecenter.ucdavis.edu/), Indiana University Center for Genomics and Bioinformatics, or at DHMRI. Illumina data were cleaned with customized scripts (http://code.google.com/p/atgc-illumina/) and assembled with the program CLC (http://www.clcdenovo.com/, CLC bio, Aarhus, Denmark) using the default settings or the program Trinity (http://trinityrnaseq.sourceforge.net/) using the Butterfly parameters –bfly_opts “–edge-thr=0.05 -V 5” to increase its ability to distinguish close paralogs.

Coverage offered by each of the assemblies was evaluated in terms of the number of unigenes, assembly length, and the proportion of ultra-conserved orthologs (UCOs) detected (Tables 3, 4) using the NCBI program blastx and an e-value threshold of 1e-10. The UCOs are 357 single-copy genes that are shared by Arabidopsis thaliana, humans, mice, yeast, fruit flies, and Caenorhabditis elegans (34). Assembly quality was evaluated by analyzing the proportion of recently duplicated paralogs in the assembly, as well as the percentage of UCOs with full-length transcripts. The proportion of recently duplicated paralogs was determined by analyzing duplicate gene age distributions using the DupPipe (6) pipeline described in 7. The rationale for this analysis is that assemblies of short reads or over-aggressive assemblies may fail to distinguish between recently diverged paralogs. The percentage of full-length transcripts was determined using the UCO hits, where transcripts were considered full-length if they covered greater than 80% of the annotated UCO protein and included start and stop codons.

Table 3. ESTs and assembly statistics for Compositae weeds targeted by this study.
Taxon Collection ID Sequence type a No. reads Total sequence (Mbp) No. unigenes Total assembly length (Mbp) % UCOs b % Full-length transcripts c % Paralogs with Ks < 0.1
Ambrosia artemisiifolia HU1-11 454 701460 185 71179 38 85 8.8 40
AA8-20 454 616318 157 62936 33 83 7.8 36
Ambrosia trifida GNV8ASA01 454 609298 221 57285 38 91 21.6 50
GNV8ASA02 454 238943 95 28574 18 77 22.5 44
GNV8ASA03 454 206343 81 25378 16 72 23.5 40
GNV8ASA04 454 192795 77 25120 16 76 25.2 35
Carthamus oxyacanthus PI 407602 454 406005 125 27255 40 85 28.6 38
Centaurea diffusa DK TR001-iL 454 407817 183 48936 31 77 20.2 67
DK US022-31E 454 631874 308 61749 43 86 23.5 63
Centaurea stoebe subsp. micranthos Cema #1A Sanger 39957 29 20922 17 80 27.2 24
Centaurea solstitialis Ceso JH1 #1 Sanger 40406 30.5 22917 19 79 26.8 26
AR-13-24 454 649880 274 43503 32 92 30.6 56
Cirsium arvense NW-22-1-M 454 3770510 1411 66269 61 99 23.7 66
Hodgins KN-ON Illumina 39316660 2988 54718 30 97 16.9 1.1
Guggisberg et al., 280808-2 Illumina 39411764 2995 46807 25 98 20.2 1.1
Helianthus annuus Hann ISI 454 1132254 446 54124 37 93 54.8 56
SAW-3 Illumina 10630366 1063 37108 13 83 13.7 0.9
Several cultivars Sanger 93428 47.9 31605 18 66 20.4 32
Helianthus ciliaris Hcil 1411 Sanger 21589 16.6 14857 12 68 26.8 22
Lactuca serriola US96UC23 Sanger 55452 34.2 19877 14 67 24.4 32
US96UC23 Illumina 91048987 7539 66733 61 100 43.8 1
Taraxacum officinale Taof A68 Sanger 41278 29 15761 12 56 34.4 41
  • a Sanger assemblies previously reported in 7
  • b Percentage of ultra-conserved orthologs (UCOs) found in EST library. UCOs refer to 357 single-copy genes that are shared by Arabidopsis thaliana, humans, mice, yeast, fruit flies, and Caenorhabditis elegans (Kozik et al., 2008).
  • c Percentage of full-length transcripts calculated for UCOs.
Table 4. Comparison of de novo assemblies of Illumina sequence data.
Taxon Collection ID Assembler No. unigenes Total assembly length (Mbp) % UCOs a % Full-length transcripts b % Paralogs with Ks < 0.1
Cirsium arvense Hodgins KN-ON CLC 54718 30 97 16.9 1.1
Trinity 60610 35 98 21.7 56.3
Guggisberg et al., 280808-2 CLC 46807 25 98 22.00 1.1
Trinity 65276 46 96 30 54.1
Helianthus annuus SAW-3 CLC 37108 13 83 13.7 0.9
Trinity 45804 20 87 21.4 65.8
Lactuca serriola US96UC23 CLC 66733 61 100 43.8 1.0
Trinity 68204 62 100 44.7 47.5
  • a Percentage of ultra-conserved orthologs (UCOs) found in EST library.
  • b Percentage of full-length transcripts calculated for UCOs.

Detection of hybridization

For several genera (Ambrosia, Centaurea, Helianthus, Lactuca), we have EST libraries from multiple taxa that frequently co-occur and potentially hybridize. To test for hybridization, we identified orthologs between all congeneric taxa using reciprocal best hits, as in 33. The distribution of Ks values (number of synonymous substitutions per synonymous site) for orthologs should be centered around a Ks value corresponding to the time since the most recent common ancestor of the taxa involved. However, a secondary peak at a lower Ks value can be attributed to more recent gene flow (75). We identified significant peaks in the ortholog Ks distribution using SiZer (14). The number of significant peaks in the range 0 < Ks < 0.1 was inferred with the maximum-likelihood approach in the EMMIX (48) package. The optimal number of peaks was inferred as the model that minimizes the Bayesian information criterion (BIC).

Gene Ontology (GO) categorization was performed on the genes found in introgressed and nonintrogressed peaks from the EMMIX analysis, using blastx searches with an e-value threshold of 10−10 against TAIR10 proteins (http://www.arabidopsis.org/). We tested for differences in GO annotations using χ2 tests with P values computed from 100000 Monte Carlo simulations in the program R (R Development Core Team, 2008). Major contributors to significant χ2 tests (P < 0.05) were identified as in 7, using residuals with absolute values greater than 2.

Microarray development

In addition to the analysis of hybridization, we employed the EST libraries from six taxa (common ragweed, diffuse knapweed, spotted knapweed, yellow starthistle, Canada thistle, and common sunflower) to develop high-density expression microarrays in collaboration with Roche NimbleGen (Madison, Wisconsin, USA) (Table 5). The microarrays were developed to investigate expression differences associated with the evolution of weedy and invasive genotypes in different Compositae weeds. However, they should be useful for a wide range of ecological, evolutionary, and molecular studies of Compositae weeds and their wild relatives.

Table 5. Microarrays developed for Compositae weeds targeted by this study.
Taxon Platform Collection ID No. unigenes No. features
Ambrosia artemisiifolia 12-plex HU1-11 45063 134996
Centaurea diffusa 12-plex DK TR001-1L 61024 136906
Centaurea solstitialis 4-plex Multiple genotypes 34343 68686
Cirsium arvense 12-plex NW-22-1-M 63690 136582
Helianthus annuus 4-plex Several cultivars 33376 68400
Helianthus annuus 12-plex Hann ISI 48683 136454

The NimbleGen high-density customized expression microarray service offers transcript-based probe design with long, isothermal probes. After the masking of repetitive elements, 2 or 3 unique probes were designed per unigene, with the remaining space on the array (usually less then 5%) filled with random probes for background correction. Both 4-plex and 12-plex expression microarray platforms were developed (Table 5).The platforms differ in the number of hybridizations that can be performed per array (4 vs. 12), as well as the number of probes per plex (72000 vs. 135000).

For common ragweed, diffuse knapweed, and Canada thistle, 12-plex arrays were developed using the transcriptome of a genotype from the invasive range of each species. The numbers of probes and unigenes chosen for array development are given in Table 5. Unigenes were mainly chosen for inclusion based on the quality, length, and uniqueness of sequence, but for ragweed we enriched slightly for stress-related transcripts.

For yellow starthistle, we developed a 4-plex array based on 24545 unigenes from an invasive genotype of yellow starthistle and 9798 unigenes from an invasive genotype of spotted knapweed (Table 5). Two probes were chosen per contig.

Last, for common sunflower, we developed both 4-plex and 12-plex arrays (Table 5). The 4-plex expression array was based on a Sanger transcriptome assembly of cultivated sunflower ESTs, whereas the 12-plex expression array was based on the 454 titanium transcriptome assembly from a weedy genotype collected outside of the native range of the species.

Databases

The National Center for Biotechnology Information (NCBI) recently announced that it might discontinue its Sequence Read Archive (SRA) and Trace Archive repositories for high-throughput sequencing data and that only assemblies will be archived in the future. This news is troubling because access to the raw reads will be needed for many studies in population and evolutionary genomics. Therefore, if the SRA is not continued at the NCBI or elsewhere, both the raw data and assemblies generated by this study will be archived on the Compositae Genome Project Database (http://compgenomics.ucdavis.edu/). The raw Sanger reads as well as the reference assemblies for all 22 EST libraries have already been submitted to and are accessible from GenBank and/or Dryad (see Table 6 for details).

Table 6. Accession numbers for EST libraries reported in this study.
Taxon Collection ID GenBank accessions or doi a
Ambrosia artemisiifolia HU1-11 doi:10.5061/dryad.cm7td/12
AA8-20 doi:10.5061/dryad.cm7td/3
Ambrosia trifida GNV8ASA01 doi:10.5061/dryad.cm7td/7
GNV8ASA02 doi:10.5061/dryad.cm7td/8
GNV8ASA03 doi:10.5061/dryad.cm7td/9
GNV8ASA04 doi:10.5061/dryad.cm7td/10
Carthamus oxyacanthus PI 407602 doi:10.5061/dryad.cm7td/15
Centaurea diffusa DK TR001-IL doi:10.5061/dryad.cm7td/5
DK US022-31E doi:10.5061/dryad.cm7td/6
Centaurea stoebe subsp. micranthos Cema #1A GI:124612349-124626419
Centaurea solstitialis Ceso JH1 #1 GI:124655902-124696404
AR-13-24 doi:10.5061/dryad.cm7td/4
Cirsium arvense NW-22-1-M doi:10.5061/dryad.cm7td/14
Hodgins KN-ON doi:10.5061/dryad.cm7td/13
Guggisberg et al., 280808-2 doi:10.5061/dryad.cm7td/2
Helianthus annuus Hann ISI doi:10.5061/dryad.cm7td/11
SAW-3 doi:10.5061/dryad.cm7td/1
Several cultivars a See below b
Helianthus ciliaris Hcil 1411 GI:125400397-125421999
Lactuca serriola US96UC23 (Sanger library) GI:83901317-83921492; 22397573-22415583; 22430445-22449769
US96UC23 (Illumina library) JO020427 - JO087153
Taraxacum officinale Taof A68 GI:90246684- 90345856
  • a Transcriptome assemblies uploaded to Dryad data repository, available at http://dx.doi.org/10.5061/dryad.cm7td; doi = digital object identifier.
  • b Sanger EST libraries for H. annuus previously described by 29, available at website http://cgpdb.ucdavis.edu/asteraceae_assembly/data_sequence_files/GB_ESTs_Feb_2007.sp.Heli_annu.clean.fasta.

RESULTS AND DISCUSSION

EST sequencing

The sequencing of EST libraries provides a relatively inexpensive means for sampling transcribed genes from any given tissue or organism. As a consequence, EST sequencing has been the primary entry point for genomic studies of nonmodel organisms (11; 71). EST sequence data have a broad array of applications ranging from gene discovery and annotation (65; 2), to molecular marker development (39; 23; 29), to gene expression analyses, whether directly through sequencing (63) or indirectly through microarray development (37, 38).

The most appropriate strategy for EST library development and sequencing depends on several factors, including the planned use of the library, whether a reference genome exists for the taxon being studied, and the financial resources available (11; 46; 74; 76). In this study, the main purpose of EST sequencing was gene discovery in the targeted weeds. As a consequence, in many instances, we isolated RNA from multiple tissue types and normalized the libraries to increase the likelihood of sampling rare transcripts (Tables 2, 3). Also, because sequencing technology has changed dramatically over the past decade, we have employed several different sequencing platforms, which vary in read length, types and rates of sequencing error, and cost per base pair (46; 70). Therefore, we can assess the value of increased tissue sampling and normalization relative to sequence depth for gene discovery, as well as potential trade-offs between sequencing depth and read length in the development of de novo transcriptome assemblies.

As commonly reported for other systems (53; 29; 74), sequencing from multiple tissue types and from normalized libraries did enhance gene discovery in the weeds targeted by this study (Tables 2, 3). The advantage of sequencing from multiple tissues, however, appears to be surprisingly modest. For example, the percentage of ultra-conserved orthologs increased from 83% and 85% in libraries of common ragweed that were developed from leaf tissue to 91% in the GNV8ASA01 library from giant ragweed, which was sequenced to approximately the same depth, but included RNA from four tissues (Table 3). Normalization had a larger effect, with an increase in the fraction of ultra-conserved orthologs detected from 56% in a standard library of dandelion to 80% and 79% for normalized libraries of similar depth for spotted knapweed and yellow starthistle (Ceso JH1 #1), respectively (Table 3).

As expected, greater sequencing depth (Table 3) was correlated with detection of a higher fraction of ultra-conserved orthologs (Pearson's r = 0.59; df = 20; P = 0.002). Likewise, sequencing depth was strongly correlated with total assembly length for the Sanger and 454 libraries (r = 0.80; df = 17; P = 0.000), but this correlation was somewhat weaker when the Illumina data were included (r = 0.53; df = 20; P = 0.008), presumably because of variation in the quality of the de novo assemblies of Illumina data (see below).

While next-generation sequencing platforms provide a low cost method for obtaining large quantities of transcriptome data for nonmodel organisms, concerns have been expressed about the quality of de novo assemblies deriving from these platforms (35; 61; 69), especially the failure to distinguish between close paralogs (6) and to assemble full-length transcripts (28). While paralog discrimination may not be a major issue for molecular biologists, it is critical for population genomic studies and evolutionary analyses, such as the detection of whole genome duplications (7). As a consequence, 454 sequencing, which generates read lengths of 300–500 bp, has often been employed for the development of reference transcriptomes for nonmodel organisms (71; 54; 57), despite its much greater expense when compared to the Illumina or ABI SOLiD platforms.

Our initial assembly results were generally consistent with these earlier observations. Assemblies of Sanger and 454 reads successfully distinguished between close paralogs, as measured by the proportion of duplicate genes with Ks < 0.1 (range = 22–67%, mean = 44%; Table 3). In contrast, our Illumina mRNA-Seq assemblies with CLC failed to resolve close paralogs, with the percentage of duplicates with Ks < 0.1 averaging 1.0% (Table 3). However, CLC and many other short read assemblers were developed for whole genome assemblies and are not optimal for the assembly of transcriptomes, which are expected to include huge variation in transcript coverage, as well as multiple kinds of transcripts per locus due to alternative splicing. Trinity, a recently published assembler program designed specifically for transcriptome data, is claimed to solve many of these issues (28). Our preliminary assemblies of Illumina transcriptome data indicate that close paralogs are resolved as claimed and that the program is more effective than aggressive assemblers such as CLC at recovering full-length transcripts (Table 4). Thus, it might be that the longer reads generated by Sanger or 454 are no longer necessary to generate reference transcriptomes.

Detection of hybridization

We tested several pairs of taxa for significant evidence of hybridization and introgression: common ragweed–giant ragweed, diffuse knapweed–spotted knapweed, common sunflower–cultivated sunflower, and prickly lettuce–cultivated lettuce. For the two ragweed species, and for wild sunflower, EST libraries were available for multiple accessions, which allowed us to test whether genomic patterns of hybridization and introgression, if it occurred, were similar across multiple contact zones.

For the ragweed and lettuce comparisons, only a single peak was observed in the Ks range examined, corresponding to the divergence between the two taxa. The two ragweed species showed a single, broad peak centered at Ks = 0.033+/−0.02 (Fig. 1A), regardless of populations compared, while the two lettuce species had a single peak at Ks = 0.08+/−0.005. This is the pattern expected if there has been no hybridization or introgression. The lack of evidence of introgression in the giant ragweed populations was surprising, since we identified three plants in one of the invasive populations (GNV8ASA01) that were intermediate in morphology and genome size between common and giant ragweed (Q. Yu, unpublished data). However, pollen tube growth rates of hybrid pollen are greatly reduced in this cross (73). Thus, hybrid pollen is likely to be outcompeted by parental pollen, perhaps accounting for the apparent lack of backcrossing and introgression between the two species in nature.

Details are in the caption following the image

Ks distributions and fitted normal curves for all ortholog pairs from the EMMIX analysis for four representative taxa: (A) Common vs. giant ragweed (AA8-20 vs. GNV8ASA01). (B) Diffuse knapweed from invasive range vs. spotted knapweed (DKUS022-31E vs. Cema #1A). (C) Weedy vs. domesticated sunflower (SAW3 vs. cultivars). (D) Diffuse knapweed from native range vs. spotted knapweed (DKTR001-1L vs. Cema #1A). (E) Weedy vs. domesticated sunflower (Hann - ISI vs. cultivars). (F) Prickly vs. domesticated lettuce (US96UC23 vs. cultivars).

In contrast, both the knapweed and sunflower comparisons showed two strongly significant peaks, indicating that introgression has altered the Ks distribution from that expected for divergence without gene flow. Diffuse knapweed from the invasive range had peaks at Ks = 0.012+/−0.003 and Ks = 0.033+/−0.008 (Fig. 1B), comprising 24% and 37%, respectively, of all ortholog pairs, while the native sample had peaks at Ks = 0.010+/−0.003 and Ks = 0.026+/−0.007 (Fig. 1D), comprising 20% and 26%, respectively, of all ortholog pairs. Note that the first peak in each comparison likely results from introgression, whereas the second corresponds to the divergence of the two species. Thus, the extent of introgression appears to be greater in the diffuse knapweed from the invasive than native range, as previously reported by 10 based on AFLP data. However, the introgression reported here likely occurred between diploid genotypes of the two species prior to their invasion of North America. In North America, the two species differ in ploidy, which appears to limit ongoing introgression (10). Thus, highly introgressed genotypes of diffuse knapweed appear to have colonized North America.

The GO analysis showed significant differences in the function of genes in the introgression peaks of the knapweed samples, especially in the invasive sample. In the invasive comparison, proteins involved in development are overrepresented; proteins targeted to chloroplast or “unknown” are underrepresented; those targeted to ER, extracellular processes, and ribosome are overrepresented; proteins with functions as hydrolase or transferase are underrepresented; and transcription factors, receptors, other membrane proteins, or protein kinases are overrepresented. In the native range, the significant differences are limited to “other cellular components” and are much less significant, probably because the introgression peak in the native range comparison is smaller and less well defined.

Common sunflower from Australia had signs of introgression from domesticated sunflower, with peaks at Ks = 0.009+/−0.003 as well as Ks = 0.021+/−0.007 (Fig. 1C), comprising 18% and 29%, respectively, of all ortholog pairs. Weedy sunflower from Israel showed a less pronounced but still significant peak at 0.011+/−0.003 as well as 0.026+/−0.006, comprising 14% and 36%, respectively, of all ortholog pairs. These results are consistent with reports based on analyses of microsatellites that weedy sunflowers from outside North America may have a crop–wild ancestry (50). No significant biases in introgression patterns were detected by GO analyses in either of the sunflower comparisons, which is consistent with the lack of reproductive barriers between wild and domesticated populations of common sunflower.

An important caveat in the interpretation of these results is that they are based on comparisons between individual genotypes, which may not be representative of the population or taxon as a whole. Also, Ks comparisons between individual genotypes are noisy because they do not account for variation due to the coalescent or evolutionary rate heterogeneity. Thus, future analyses would be stronger if a population approach were taken, but this approach has been cost prohibitive until very recently. Nonetheless, our results demonstrate the power of this approach for detecting hybridization and introgression and for studying the kinds of genes that are most likely to introgress. By analyzing thousands of genes, robust conclusions can be made from noisy data.

Tools for analyses of gene expression and regulation

One of the main motivations for the EST sequencing reported here was to generate reference transcriptomes for Compositae weeds that could be used for studies of gene expression and regulation. Such studies are underway for five Compositae weeds (common ragweed, Canada thistle, yellow starthistle, diffuse knapweed, and common sunflower) and exploit the reference transcriptomes (Tables 24) and NimblegGen microarrays (Table 5) reported here. The 12-plex NimbleGen expression arrays represent an especially cost-effective strategy for population studies of expression variation, since only a handful of arrays are required for comprehensive analyses of gene expression patterns. Nonetheless, with the reduction in sequencing prices, it has become more cost feasible to study expression by deep sequencing of the transcriptome (63) in nonmodel organisms. However, even sequence-based studies of gene expression will require a reference transcriptome (or fully sequenced genome) for analyses, so the resources reported here will continue to be useful for expression studies of Compositae weeds.

CONCLUSIONS

We have generated EST resources and microarrays for 11 Compositae weeds, which we hope will facilitate studies of the origin and evolution of Compositae weeds, as well as the molecular basis of weedy traits in this group such as herbicide resistance (e.g., 54) or growth-defense trade-offs (e.g., 47). The resources presented were developed over 11 years, mainly by the Compositae Genome Project (http://compgenomics.ucdavis.edu/), and thus also demonstrate how strategies have been continuously refined to exploit advances in high-throughput sequencing and computational biology. Most recently, the development of the Trinity de novo assembler of short-read transcriptome data may allow reference-quality transcriptomes to be developed from very low-cost Illumina or ABI SOLiD sequence data (28), which could greatly reduce the cost of entry for genomic studies of nonmodel organisms.

Our study also demonstrates the utility of ortholog comparisons for identifying hybridization and quantifying the extent of introgression (33). While several authors have discussed the apparent association between hybridization and invasiveness (1; 25; 60), it generally is not clear whether hybridization is a cause or consequence of range expansions (although see 77, 78). Analyses of genomic data provide a sensitive and robust approach for detecting hybridization and introgression and for investigating whether introgressed variants have contributed to adaptive changes in weeds and invasive plants.