Integrating herbarium specimen observations into global phenology data systems

Premise of the Study The Plant Phenology Ontology (PPO) was originally developed to integrate phenology observations of whole plants across different global observation networks. Here we describe a new release of the PPO and associated data pipelines that supports integration of phenology observations from herbarium specimens, which provide historical and modern phenology data. Methods and Results Critical changes to the PPO include key terms that describe how measurements from parts of plants, which are captured in most imaged herbarium specimens, relate to whole plants. We provide proof of concept for ingesting annotations from imaged herbarium sheets of Prunus serotina, the common black cherry. We then provide an example analysis of changes in flowering timing over the past 125 years, demonstrating the value of integrating herbarium and observational phenology data sets. Conclusions These conceptual and technical advances will support the addition of phenology data from herbaria, but also could be expanded upon to facilitate the inclusion of data from photograph‐based citizen science platforms. With the incorporation of herbarium phenology data, new historical baseline data will strengthen the capability to monitor, model, and forecast plant phenology changes.

Plant phenology-the timing of plant life-cycle events such as flowering or leafing out-plays a fundamental role in the functioning of terrestrial ecosystems, including human agricultural systems (Reilly et al., 1996;Chmielewski et al., 2004;Visser and Both, 2005;Franks et al., 2007;Bertin, 2008;Willis et al., 2008;Miller-Rushing et al., 2010;Anderson et al., 2012;McKinney et al., 2012;Miller-Struttmann et al., 2015). Phenology shifts over time are often the most immediate and visible ecological response to environmental change, and as a result, can serve as a "canary in the coal mine" for more drastic ecosystem changes (Parmesan and Yohe, 2003;Menzel et al., 2006;Cleland et al., 2007;Intergovernmental Panel on Climate Change, 2007;Wolkovich et al., 2012;Chuine and Régnière, 2017). Given the need to understand how phenology is changing in response to human pressure, monitoring programs have been set up at regional and continental scales to provide the evidential basis for detecting change (Koch et al., 2010;Rosemartin et al., 2014;Elmendorf et al., 2016;Templ et al., 2018). However, as noted recently (Kissling et al., 2018;Stucky et al., 2018), such systems often have different reporting standards and procedures, which can make broader-scale interoperability a challenge. Additionally, there are many untapped sources of phenological data over longer time scales, such as herbarium specimens, that could be integrated with data from observation networks, but there are technical challenges to doing so properly.
Initial development of the Plant Phenology Ontology (PPO; Stucky et al., 2018) and associated informatics pipelines   Koch et al., 2010;Templ et al., 2018;http://www.pep725. eu/). Data in this knowledge base can be accessed via a web portal (http://www.plantphenology.org) and an R package (https:// cran.r-project.org/web/packages/rppo/). Although this knowledge base greatly expands the spatial coverage of integrated plant phenological data for research, most phenology observation networks were established within the current century and thus lack historical reports of phenology for most plant taxa (Stucky et al., 2018). Herbarium specimens contain a wealth of historical phenological information that, when annotated and shared, could improve both the temporal and spatial coverage of available phenological data (Davis et al., 2015). With large-scale imaging of plant specimens as part of national digitization efforts such as iDigBio (Page et al., 2015; http://www.idigbio.org/), there is an unparalleled opportunity to assemble hundreds of millions of new phenology observations based on these imaged herbarium specimens. A key next step in phenological data integration efforts is to integrate phenological data from herbarium specimens with phenological data from field observations. In this paper, we provide a new framework for integrating herbarium and field phenology observations, and we show that such integration is technically feasible. First, we describe the changes to the PPO needed to integrate phenological data from herbarium records with those from in-situ observations of whole plants. Then, we describe the updated pipeline tools that take raw, input phenology observations and convert them to a knowledge base of interoperable data available to any interested user. Finally, we provide an example analysis using integrated herbarium and observation data to demonstrate that the increased temporal coverage is useful for documenting phenology change dynamics. We close by briefly discussing potential applications of this work beyond herbarium records, including the burgeoning photographic evidence of phenology coming from incidentally collected citizen science efforts.

Modeling phenology observations from herbarium specimens in the PPO
The PPO provides the standardized terminology, definitions, and logical axioms that are needed for large-scale phenological data integration. Because observation network data are based on whole plants, and herbarium sheets often do not contain whole plants, our goal was to extend the PPO and the supporting data integration pipeline to enable accurate inferences about phenological data from herbarium specimens.
The first step in accomplishing this was to develop a way to model phenology observations of parts of plants in the logical framework of the PPO. Given its initial design for observation network data, the logical axioms in the PPO connected all phenology observations to stances of the class 'whole plants' (from the Plant Ontology; Cooper et al., 2013). To make inferences about herbarium data, we needed a way to relate parts of plants to whole plants, as well as a way to logically translate what a phenology observation of a part of a plant means in the context of a whole plant. To accomplish this, we created a new class in the PPO called 'portion of a plant', which is defined as a 'plant structure' (from the Plant Ontology) that 'is or was part of' a 'whole plant' (Cooper et al., 2013). (Note that 'is or was part of' is also a new object property in the PPO.) To facilitate translation of phenology observations of a 'portion of a plant' to information about the associated 'whole plant', we created a new object property called 'generated from' to describe the relationship between the original phenology observation data for a 'portion of a plant' and another set of phenological data for the 'whole plant' of which that 'portion of a plant' was a part. With these new entities, we extended the PPO to allow for phenology observations of instances of 'portion of a plant' as well as instances of 'whole plant'. This new model maintains the PPO logical backbone that relies on instances of whole plants to make inferences without losing accuracy when observing plant parts. The resulting full model for phenology observations of parts of plants is described in detail below (see Results). The addition of new terms and properties to the PPO had nontrivial cascading effects on the existing ontology structure; in all, we had to change axioms for well over 100 terms in the PPO. After finalizing the new version of the PPO, we merged all changes into the main Plant Phenology Ontology repository on GitHub (https://github. com/PlantPhenoOntology/ppo) and created a new ontology release (https://github.com/PlantPhenoOntology/ppo/releases/tag/ v2019-01-16).

Assembling a test data set, formatting and mapping the test data set to the PPO
To provide a herbarium phenology test case for the PPO and our integration pipeline, we generated first-order phenology scorings (Yost et al., 2018) for images of P. serotina Ehrh. We analyzed all images on iDigBio that were linked to digitized specimen records with georeferences. The institutions that house these specimens are listed in Appendix 1.
One of us (R.P.G.) with experience in annotating Prunus L. species scored each image for presence or absence of unopened flowers, opened flowers, senesced flowers, and fruits. During scoring, any potential species misidentifications were noted and eliminated from the final data set, because it can be challenging to distinguish P. serotina from other Prunus species, especially P. virginiana L. Scoring of opened flowers, unopened flowers, and fruits followed reporting standards from the NPN (Rosemartin et al., 2014). Transitional cases where early fruits are barely visible but flower material is still present were coded as senesced flowers. These were all later double-checked by R.P.G., after also scoring multiple other Prunus species, in order to verify accuracy. In total, 570 images were scored.

Changes to the supporting integration pipeline and web portal user interfaces
The phenology data integration pipeline was originally developed for processing whole plant observation data acquired in the field. We extended the pipeline by implementing new rules, detailed in http://www.wileyonlinelibrary.com/journal/AppsPlantSci © 2019 Brenskelle et al. The informatics pipeline takes incoming data as commaseparated value (CSV) files, converts the data to Resource Description Framework (RDF) triples, runs inferencing on the RDF triples, and writes the output data back to a CSV format. Because analyses of phenology observing process data are based on whole plants, and not portions of plants, for now the pipeline only produces output data about whole plants, not portions of plants. All pipeline code is available at: https://github.com/biocodellc/ontologydata-pipeline. The pipeline configurations used for this work can be found at: https://github.com/biocodellc/ppo-data-pipeline.

Example data analysis
Prior to herbarium data being added to the Global Plant Phenology Portal (http://www.plantphenology.org), the earliest records of opened flowers available for our exemplar species, P. serotina, dated to 2007. After we added the herbarium records, we had a total of 969 observations of opened flowers for P. serotina dating back to 1875, with 203 herbarium specimen annotations and 766 observations from the NPN.
To analyze the effects of latitude and time on the earliest day of flowering, we fit a multiple linear regression model with day_of_earli-est_flowering as the response and year, latitude, and the interaction between year and latitude as predictors (i.e., day_of_earliest_flowering = year + latitude + latitude*year). To check for biases in observation dates between the two data sources (NPN and herbarium data), we analyzed a linear regression model with day_of_earliest_flowering as the response and latitude and data_source as predictors, using only data from the years 2007-2018, the years for which both NPN and herbarium data were available. Prior to these statistical analyses, we spatially aggregated the data. To do this, we first aggregated all observations to 0.1-degree grid cells. Then, for each grid cell, the earliest flowering date reported within the grid cell for a given year was the value used for fitting the statistical models. To test for observation biases, these aggregation steps were done separately for each data source.

Updated phenology observation model
The PPO's new model of a phenology observation of a herbarium sheet or any other 'portion of a plant' (e.g., a citizen science photograph of a branch of tree) is illustrated in Figure 1. In the PPO, 'portion of a plant' is a subclass of 'plant structure', similar to the existing term 'whole plant'. The overall process model starts with the input of a 'portion of a plant' into a 'phenology observing process' with an output of a measurement of a 'plant phenological trait'. In the example shown in Figure 1, the output of the observing process is a measurement of 10 to 20 'unfolded true leaves present'. In order to make further inferences about whole plants, the relationship 'is or was part of' is used to link a 'portion of a plant' to a 'whole plant'. The term 'is or was part of' accounts for the fact that a 'portion of a plant' may either be derived from a plant part (i.e., it was at some point part of a whole plant) or may still be part of a plant. Although herbarium specimen images are all of structures that were derived from a plant part, we use "is or was" to be more general and account for future work on images of intact plants that show only part of the plant.
Another key part of the new observation model is the use of 'generated from' to link the original 'measurement datum', which is a direct output from the observing process, to an IAO: 'data item' (from the Information Artifact Ontology [IAO]; Ceusters, 2012) about a whole plant. We use 'data item' because the data about the 'whole plant' are derived from the data about the 'portion of a plant' and thus are not directly measured (and, therefore, cannot be instances of 'measurement datum'). To connect a 'data item' to the observed phenological trait, we also minted a new property 'quality datum of', which also has an inverse property 'has quality datum' (see Fig. 1). The data about the 'whole plant' are generated by the integration pipeline, as discussed above.
We must note that not all herbarium specimens are a 'portion of a plant'. Herbarium sheets can and do contain whole plants, and although this is uncommon for Prunus and most other woody taxa, it is common for many herbaceous species. The PPO's new data model does not require that herbarium specimens be treated as a 'portion of a plant'. Rather, herbarium specimens can also be represented as instances of 'whole plant' when appropriate. Even if a herbarium specimen that is a 'whole plant' is mistakenly treated as 'portion of a plant', the axioms of the PPO and logic in the pipeline are such that no incorrect inferences will be obtained, although some inferences will be less informative than possible (e.g., observations of absences; see Table 1). There are also cases in which a single herbarium sheet contains multiple specimens; in these cases, a separate observation should be recorded for each specimen to ensure correct reasoning.

Exemplar data on the Global Plant Phenology Data Portal
The test data set assembled for P. serotina was run through the updated PPO data integration pipeline and ultimately added to the Global Plant Phenology knowledge base (http://www. plantphenology.org). The ingest toolkit and all individual steps TABLE 1. Observations of a 'portion of a plant' generate an output trait called a 'data item', but these need to be translated into descriptions of phenological traits for the associated 'whole plant'. Below is the logical mapping used by the pipeline to make those translations. For example, if an observer reports a lower count of five flowers and upper count of 10 on a herbarium specimen image of a 'portion of a plant', the qualitative reporting is 'present' for 'portion of a plant'. The mapping for the whole plant output is a lower count of five and an undefined upper count, because it is impossible to know how many flowers were actually on the whole plant. a are described more fully in the plant phenology data pipeline Github repository (https://github.com/biocodellc/ppo-data-pipeline).

'portion of a plant' observation data
Mapping files required to perform integration and the ingested data file itself are available on GitHub (https://github.com/ biocodellc/ppo-data-pipeline/tree/master/projects/herbarium), thus providing a general template for further ingestion of new herbarium data. All of these phenological data for P. serotina are now available on the online portal, as shown in Figure 2. The portal interface has been adjusted so that at the top of the page, users can see a breakdown of the data sources where these results came from. In the P. serotina example, there were 203 total herbarium records with open flowers and 766 open flower observations from the NPN. The page also presents users with two options for how to view these data on the interface via a map visualization (Fig. 2) or a standard table with text fields and values that provide standardized content with the same field headers and other elements for all data resources. All data can be downloaded for further analyses using the "Download" button. All herbarium annotations have Uniform Resource Identifier (URI) links back to the specimen records from which those annotations were made.

Results of Prunus serotina analysis
Our analysis of flowering times revealed that, on average, P. serotina in North America has steadily accelerated its flowering times since 1873 (Fig. 3), and this effect was statistically significant after controlling for the effects of latitude (overall model P < 0.001, F = 105.2 [3 and 366 df], adjusted R 2 = 0.459; see Table 2 for individual coefficient estimates and P values). The interaction between year and latitude was also significant (Table 2), which suggests that phenological shifts vary by geographic location. Our analysis of potential bias in observation dates between the two data sources did not indicate a significant difference in observation dates between NPN and herbarium-based observations when controlling for latitudinal effects (estimated mean difference between NPN and herbariumbased observations: 3.13 days, P = 0.695). However, the sample sizes for this analysis were extremely unbalanced (NPN: n = 178, herbarium: n = 16), so it is difficult to draw any definitive conclusions about data source biases.

Future improvements to the PPO related to portions of plants
We have provided a model and shown proof of concept for assembling data products from herbarium specimens or any other resource that contains parts of plants as opposed to whole plants. To move forward with integrating herbarium-and field-based phenological data, we took some shortcuts in ontology development that allowed us to proceed while we wait for completion of complementary work in external ontologies. Specifically, we created two object properties-'is or was part of' and 'generated from'-that are consistent with the logic of the Relations Ontology (RO) (http://purl. obolibrary.org/obo/ro.owl), but can be made more logically meaningful and compatible with other Open Biological and Biomedical Ontology (OBO) Foundry (http://www.obofoundry.org/) ontologies in future releases of the PPO. For "parthood, " we required that some significant portion of the plant is not present or not visible (i.e., stronger than an irreflexive version of the RO: 'part of') both in the present or past tenses. There is currently no easy way to express this using extant terms in OBO ontologies, so our solution is to use the named class 'portion of a plant' defined by the 'is or was part of' relation and make the inferences of presence traits in the data ingest pipeline, as described above. Because similar kinds of traits exist more widely in biodiversity studies, we will work with the BioCollections Ontology (Walls et al., 2014) to develop a robust ontology design pattern for this kind of data.
The 'generated from' relation is a "shortcut" relation that implies an instance of a non-specified process that takes the observation of a 'portion of a plant' as input and generates data about an associated 'whole plant' . For the purposes of integrating this initial set of herbarium data, it was not necessary to define this process and specify the full chain of logic leading to the final data. However, implementing the full process model would provide more information and would be more robust to future data needs. Because this pattern of deriving data from an observation is a process that is used widely beyond phenology, rather than implementing it within the PPO, we will wait until the BioCollections Ontology develops the logic for this type of process and then reuse that pattern in the PPO.

Scaling up with more herbarium specimen data
Large-scale imaging of natural history collections and associated digitization of specimen label information is well underway with tens of millions of images already available, and with potentially hundreds of millions of images possible. The example analysis with P. serotina records showcases the value of integrating herbarium specimen data with more recent, field observation data. Prior to the addition of the 570 phenology observations from herbarium specimens, which include all of the data prior to 2007, an analysis of long-term temporal change in P. serotina phenology would have been challenging. Although this analysis is not meant to determine drivers of change, we did find a significant trend toward earlier flowering over time, when accounting for latitude. Given the millions of herbarium specimen images already available in online databases, there is enormous potential to develop a comprehensive phenology database with a rich historical record that would enable significant new research about phenology changes. Currently, obtaining phenological data from herbarium specimens (or images of specimens) remains a labor-intensive task, which is a significant barrier for developing large historical phenology data sets. However, crowdsourced citizen science platforms such as Notes from Nature (Hill et al., 2012; https://www.notesfromnature.org/) and CrowdCurio (Willis et al., 2017) offer at least a partial solution to this problem. In addition, computer vision techniques could allow for rapid, highly automated extraction of phenological data from specimen images (Lorieul et al., 2019;Stucky et al., unpublished data). Finally, not all herbarium specimens are parts of plants. Sometimes, whole plants are captured on sheets, and there are also cases of multiple plants on a single sheet. Recording this information could allow smarter inferences about how observations of specimens relate to whole plants. This is another area to explore, and potentially also to leverage machine learning approaches. We close by noting that herbarium specimens are not the only phenology-relevant resources that feature parts of plants. Novel citizen science platforms are generating a significant volume of plant images, as are other resources such as historical photos or videos that contain phenology-relevant data. For example, there are currently (as of 24 October 2018) 5860 photographs from 3000 observers of black cherry (P. serotina) plants on the citizen science platform iNaturalist (https://www.inaturalist.org). The vast majority of those records were generated in the past two years, suggesting strongly that future growth may be even faster. Full utilization of such resources for phenology represents an untapped potential, and there is no doubt that flowering state can be discerned from most of these opportunistic   reports. The innovations here for integrating herbarium specimens will work just as well for phenology trait capture from citizen science photographs of parts of plants. It is via a combination of data sets that we will be best able to examine the longer-term picture of phenology change, and to put together the most comprehensive data possible to continue monitoring phenology. We argue that data integration from multiple resources using tools such as the PPO and ontologyinformed data integration pipelines is a critical step to meet phenology monitoring and modeling goals.

ACKNOWLEDGMENTS
The authors thank Ellen Denny and Jake Weltzin from the National Phenology Network for their support. Grants from the U.S. Geological Survey under Grant/Cooperative Agreement No. G16AC00268 and from the National Science Foundation-Science Across Virtual Institutes (NSF-SAVI; Award Number 1321595) were critical in development of the Plant Phenology Ontology and associated data processing resources upon which this contribution is built. R.W. was supported by CyVerse (NSF DBI-0735191, DBI-1265383).

AUTHOR CONTRIBUTIONS
L.B., R.P.G., B.J.S., and R.W. conceived of this study. R.P.G. performed data collection for the proof of concept work and L.B., R.P.G., and B.J.S. performed the statistical analyses presented here. B.J.S., L.B., and R.W. made needed changes to the Plant Phenology Ontology and dealt with managing new releases. J.D., with help from L.B., adapted the mapping files and updated the pipeline to support herbarium data integration. L.B., R.P.G., B.J.S., R.W., and J.D. all helped write the paper.

DATA ACCESSIBILITY
The original input CSV file for the Prunus serotina example data set is archived on Zenodo https://doi.org/10.5281/zenodo.1473702.