Fluidigm2PURC: Automated processing and haplotype inference for double‐barcoded PCR amplicons

Premise of the Study Targeted enrichment strategies for phylogenomic inference are a time‐ and cost‐efficient way to collect DNA sequence data for large numbers of individuals at multiple, independent loci. Automated and reproducible processing of these data is a crucial step for researchers conducting phylogenetic studies. Methods and Results We present Fluidigm2PURC, an open source Python utility for processing paired‐end Illumina data from double‐barcoded PCR amplicons. In combination with the program PURC (Pipeline for Untangling Reticulate Complexes), our scripts process raw FASTQ files for analysis with PURC and use its output to infer haplotypes for diploids, polyploids, and samples with unknown ploidy. We demonstrate the use of the pipeline with an example data set from the genus Thalictrum (Ranunculaceae). Conclusions Fluidigm2PURC is freely available for Unix‐like operating systems on GitHub (https://github.com/pblischak/fluidigm2purc) and for all operating systems through Docker (https://hub.docker.com/r/pblischak/fluidigm2purc).

amplification. Double-barcoded amplicons can also be generated by other means using approaches such as traditional or highly multiplexed PCR (e.g., Bybee et al., 2011;Dupuis et al., 2017).
Previous methods to analyze these data have typically relied on generating consensus sequences using software packages such as Geneious (Kearse et al., 2012;e.g. Gostel et al., 2015), HiMAP (Dupuis et al., 2017), or an R script, reduce_amplicons.R, that is part of the dbcAmplicons package (but see comparison with "occurrencebased" methods in dbcAmplicons in the "Example Analysis" section; Uribe-Convers et al., 2016;Kates et al., 2017). However, using consensus sequences can often ignore important within-individual level variation, such as differing alleles or levels of ploidy. To alleviate this issue and to facilitate the analysis of these data for haplotype inference, we developed Fluidigm2PURC. Fluidigm2PURC consists of two main Python scripts that process input data files using several external programs ( Table 1) that automate quality filtering, read merging, and file formatting for downstream steps (Fig. 1). Although it can be used to process any double-barcoded amplicons, the software derives its name from the method of PCR amplification that we used to generate our data (Fluidigm Access Array), as well as its primary dependency, PURC, a Python program that combines sequence clustering and PCR chimera detection (Rothfels et al., 2017). The final step in the Fluidigm2PURC pipeline processes clusters from PURC and outputs a FASTA file containing phased haplotypes for all targeted sequences. This last step has methods for haplotype inference that work on diploids, polyploids, individuals with unknown ploidy, or any mixture of the three. To demonstrate the utility of Fluidigm2PURC, we analyzed nuclear amplicon data from the genus Thalictrum L. (Ranunculaceae) and compared the results with those obtained from dbcAmplicons using the reduce_ amplicons.R script (Uribe-Convers et al., 2016).

Input data
The input data for Fluidigm2PURC are paired-end FASTQ files that have been demultiplexed using the program dbcAmplicons (Uribe-Convers et al., 2016). dbcAmplicons demultiplexes reads using the original sample barcodes and amplicon primer sequences to annotate the reads with the sample and locus name that each read comes from, followed by trimming these identifying parts of the sequence. The resulting pair of FASTQ files is then input into the first script in the pipeline, fluidigm2purc.
Step 1: fluidigm2purc The fluidigm2purc script takes the paired-end FASTQ files, filters them using Sickle (Joshi and Fass, 2011; minimum length = 100 bp, PHRED threshold = 20), merges the filtered reads using FLASH2 (Magoč and Salzberg, 2011), and then converts the resulting FASTQ files into FASTA files (one for each locus) with sequence header information that is compatible with PURC. The sequence headers for PURC follow the format '>IndividualName|LocusName|UniqueID#' . When paired reads with low-quality bases are trimmed by Sickle, or if the targeted region is too long so that the two reads do not overlap, we merge them artificially with multiple N's inserted in between. This allows us to keep paired reads together in downstream steps. The fluidigm2purc script writes two additional files: (1) the taxon table, a two-column table listing each sequenced taxon and its ploidy level, and (2) the locus-err table, a two-column table listing each sequenced locus and the average level of sequencing error for all reads coming from that locus. The taxon table lists the ploidy as "None" for all individuals by default, but known ploidy levels can be included by the user (e.g., diploid has the value "2, " tetraploid has the value "4, " etc.). For the locus-err table, the per-locus levels of sequencing error are calculated individually from the input FASTQ files using the average PHRED score per read averaged across all reads coming from that locus.
Step 2: PURC The output FASTA files from fluidigm2purc can be run through PURC using the purc_recluster.py script (Rothfels et al., 2017). This script is used to iteratively run chimera detection and sequence clustering (performed with USEARCH; Edgar, 2010;Edgar et al., 2011) on each locus individually to produce a reduced set of putative haplotypes that includes size information about the number of original reads forming each cluster. Details on running PURC can be found on its Bitbucket page (https://bitbucket.org/crothfels/ purc).

Step 3: crunch_clusters
The clusters output by PURC are then run through our second script, crunch_clusters, which uses the taxon table and locus-err table output by fluidigm2purc (Step 1) to infer haplotypes in a maximum likelihood framework. This script also has options for realigning clusters using MAFFT (Katoh, 2013), as well as cleaning the clusters using Phyutility (Smith and Dunn, 2008).
Before haplotypes can be inferred at a locus, we first do a pairwise comparison of all clusters for each taxon individually and merge any clusters that are identical (ignoring gaps). This step is necessary because of the initial trimming/filtering step in the flui-digm2purc script. Artificially joining unmerged reads often causes two sequencing clusters representing the same haplotype to form: (1) one cluster for reads that were merged, and (2) one cluster for the reads that were artificially merged and contain a large number of gapped sites in the middle. In this case, these two clusters should not be treated as separate haplotypes, so we combine the clusters by keeping the larger haplotype (i.e., the one with fewer gaps) and adding the sizes of the two clusters together. The alternative would be to process the original data by ignoring all reads that did not merge. However, throwing away unmerged reads could potentially discard sequence variation that should be represented in the data set, especially if most reads are unmerged, which may be the case for large amplicons.
The downside of merging sequences that are identical except for gaps is that it potentially discards informative indel variation, although it is unlikely that a locus within an individual would have only one of its haplotypes containing gaps and not the others.
Overall, we felt that this approach provided the best method for including more of the original data when inferring haplotypes.

Inferring haplotypes with ploidy informa-
tion-For known ploidy levels, we use a multinomial likelihood to determine the number of copies of each potential haplotype using the ordered cluster sizes returned by PURC (largest to smallest). Given an individual of ploidy level K, we enumerate the number of possible haplotype configurations using integer partitions (an unordered set of integers that sums to K; Stojmenovic and Zoghbi, 1998). Because the cluster sizes are sorted, we never need to consider more than the first K largest clusters. For example, a tetraploid can have a maximum of four haplotypes, and the integer partitions to consider are (4,0,0,0), (3,1,0,0), (2,2,0,0), (2,1,1,0), and (1,1,1,1). This corresponds to (four copies of haplotype 1), (three copies of haplotype 1, one copy of haplotype 2), (two copies of haplotype 1, two copies of haplotype 2), (two copies of haplotype 1, one copy of haplotype 2, one copy of haplotype 3), and (one copy of haplotype 1, one copy of haplotype 2, one copy of haplotype 3, one copy of haplotype 4). The mathematical details for the likelihood function with an example calculation are presented in Appendix S1 (section S1.1). Once the most likely configuration has been identified, the crunch_clusters script will return each haplotype in proportion to its representation in the maximum likelihood estimate. We have also provided options to return only unique haplotypes and to treat loci as haploid, the latter of which can be used to process organellar data. The haploid option can also be used as an alternative to finding consensus sequences for nuclear loci by returning only the cluster with the most reads.
Inferring haplotypes without ploidy information-For unknown ploidy levels, we no longer have information about the maximum number of haplotypes that an individual can have. However, we can use the cluster sizes to infer which clusters from PURC are actual haplotypes versus those that are likely to be sequencing errors. We do this by calculating the likelihood that each successive haplotype in the sorted list is a "real" haplotype versus a sequencing error.
As an example, consider a tetraploid with six clusters identified by PURC. We first calculate the likelihood that all clusters are errors.
Then we calculate the likelihood that cluster 1 is a real haplotype, and 2 through 6 are errors. Next, we calculate the likelihood that clusters 1 and 2 are real haplotypes, and that 3 through 6 are errors. This continues until we calculate the likelihood of all six clusters being real haplotypes. We then apply a cutoff that uses the relative increase in the likelihood when an additional haplotype is added. If treating an additional cluster as a haplotype increases the likelihood by less than the cutoff, then only the previous haplotypes are kept and the others are considered errors. We use a default cutoff of 10% increase in the likelihood. An example with the likelihood function that we use for this approach is provided in Appendix S1 (section S1.2).

Example analysis
To demonstrate the use of the Fluidigm2PURC pipeline, we analyzed amplicon sequence data generated from orthologs of the nuclear gene PISTILLATA (PI) in the genus Thalictrum (Ranunculaceae), which is single-copy in diploids and two-copy in tetraploids (Di Stilio et al., 2005). PI is responsible for establishing stamen and petal identity during flower development in Arabidopsis thaliana (L.) Heynh. (Goto and Meyerowitz, 1994) and has been used to detect  (Lee et al., 2002;Soza et al., 2014). Given the length of the PI locus, primers were designed to sequence exons 3 to 6 in two overlapping ~600-bp segments: exons 3 to 5 (PIS_4) and exons 4 to 6 (PIS_3). Our analyses focused on six species with known ploidy levels ranging from diploid (2n = 2x =14) to 22-ploid (2n = 22x = 154). These species are presented in Table 2, with accession numbers following Soza et al. (2013). Paired-end reads were demultiplexed and annotated using dbcAmplicons (Uribe-Convers et al., 2016) followed by read trimming, merging, and sequence renaming using the fluidigm2purc script with default options. All reads coming from PIS_3 and PIS_4 were then run separately through PURC using the purc_recluster.py script (Rothfels et al., 2017). After clustering and chimera detection, we determined haplotypes for each amplicon using three different approaches: (1) consensus sequences using the '--haploid' option, (2) unique haplotypes assuming unknown ploidy (10% likelihood cutoff), and (3) unique haplotypes using known ploidy. For each of these methods, we realigned and cleaned the sequences using MAFFT (Katoh, 2013) and Phyutility (added the options '--realign --clean 0.33'; Smith and Dunn, 2008). As a comparison, we also analyzed these data using the reduce_amplicons.R script from the dbcAmplicons package (v0.8.5; Uribe-Convers et al., 2016). This script merges paired-end reads using FLASH2 (Magoč and Salzberg, 2011) and allows for a global read trimming size to be used for read 1, read 2, or both. Unmerged reads are treated independently, resulting in separate haplotypes for read 1 and read 2. The final result is a FASTA file with the unaligned haplotypes that can be further processed for downstream applications. We generated consensus haplotypes as well as haplotypes based on read occurrence (controlled by the minimum read frequency and minimum read count) using the default settings, and trimmed 20 bp from read 1 and 40 bp from read 2 (base quality drops off more quickly in read 2, so we removed more bases; Uribe-Convers et al., 2016). We then aligned the resulting sequences using MAFFT (Katoh, 2013). These results were compared to the haplotypes from Fluidigm2PURC based on (1) the number of recovered haplotypes, (2) the length of the resulting alignment, and (3) the amount of gaps in the alignment.
Results-Haplotypes inferred by both methods were visualized and compared using alignment statistics computed in Geneious version 8.1.8 (Kearse et al., 2012) and MEGA version 7.0.18 (Kumar et al., 2016). Consensus sequences from the Fluidigm2PURC and dbc Amplicons pipelines were similar overall, with the reduce_ amplicons.R script producing longer haplotypes, but containing more gaps (Table 3). We then compared the occurrence-based method from the reduce_amplicons.R script with the crunch_clusters results when ploidy levels are treated as unknown. In this case, Fluidigm2PURC recovered more haplotypes with fewer gaps and more parsimony informative sites. We believe the reason that the reduce_amplicons.R script recovered so few haplotypes is due to its use of minimum read count and frequency criteria that rely on reads being identical to form haplotypes, rather than clustering based on similarity. Inferring haplotypes with Fluidigm2PURC using known ploidy levels resulted in the largest number of recovered haplotypes. The reason that using known versus unknown ploidy levels produced more haplotypes (PIS_3: 57 vs. 18, PIS_4: 43 vs. 14) was because the cluster sizes that went into the likelihood calculation were disparate for some species (a few large clusters and many smaller ones), making the smaller clusters difficult to model when the ploidy level was unknown due to lack   of prior knowledge about how many haplotypes should be expected. On a per species basis, using known ploidy levels always led to more inferred haplotypes, but only for polyploids (Table 4). For example, the PIS_4 region for Thalictrum pubescens Pursh recovered 15 haplotypes when assuming known ploidy (analyzed as 22x), but only one haplotype when assuming unknown ploidy. The reason for this is that the cluster data for this species had one putative haplotype with many reads (147), but all other putative haplotypes had far fewer reads (the next largest cluster had 25 reads, and nine clusters had fewer than 10 reads). In general, drawing the line between real haplotypes and errors for clusters with lower read counts is difficult when the ploidy level is unknown. By applying a threshold, the method we implement is a conservative way to estimate haplotypes that only includes clusters with the highest read counts.

CONCLUSIONS
The ability to infer haplotypes regardless of an individual's ploidy level is a crucial step toward understanding the complex relationships within many plant groups whose evolutionary histories often contain multiple instances of hybridization and whole genome duplication (Soltis and Soltis, 2009;Van de Peer et al., 2009). As models that accommodate these processes continue to be developed (e.g., Jones et al., 2013;Solís-Lemus and Ané, 2016;Oberprieler et al., 2017;Thomas et al., 2017;Wen and Nakhleh, 2018), we anticipate that the functionality of our pipeline will be especially useful for conducting phylogenomic studies with nuclear sequence data. Furthermore, the increase in genomic resources for taxa across the plant tree of life will continue to facilitate the process of phylogenetic marker development, allowing more researchers to take advantage of targeted enrichment strategies such as double-barcoded amplicon sequencing. Compared with existing approaches for analyzing these data, the methods we present here offer an improved workflow for sequence processing, clustering, and haplotype inference, and are particularly well suited for analyses in taxa for which there is incomplete knowledge about ploidy levels.

ACKNOWLEDGMENTS
The authors thank C. Rothfels for helpful discussions regarding the use of PURC. We also thank S. Uribe-Convers for providing valuable feedback while testing the Fluidigm2PURC code. This work was supported by the following grants from the National Science

DATA ACCESSIBILITY
Fluidigm2PURC is an open source software that is freely available on GitHub (https://github.com/pblischak/fluidigm2purc) for Unixlike operating systems (Mac, Linux) under the GNU General Public License v3. We have also built a Docker image with all dependencies (Table 1) pre-installed for use on any operating system with a compatible distribution of the Docker software (https://hub.docker.com/r /pblischak/fluidigm2purc) (https://www.docker.com; Merkel, 2014). Fluidigm2PURC is written in Python and has been successfully tested using Python versions 2.7 and 3.6. Documentation for the software can be found on ReadTheDocs (http://fluidigm2purc.readthedocs.io). The code for each step of our example analysis is available in Appendix S1 (section S2). Raw sequence data from the PIS_3 and PIS_4 loci for the six sampled Thalictrum species, as well as all output FASTA files from the Fluidigm2PURC and dbcAmplicons pipelines, are available from the Dryad Digital Repository (https://doi. org/10.5061/dryad.89k5k30; Blischak et al., 2018).

SUPPORTING INFORMATION
Additional Supporting Information (Appendix S1) may be found online in the supporting information tab for this article. For Thalictrum pubescens, haplotypes are presented at both the 12x and 22x level.