1
|
Miralles A, Bruy T, Wolcott K, Scherz MD, Begerow D, Beszteri B, Bonkowski M, Felden J, Gemeinholzer B, Glaw F, Glöckner FO, Hawlitschek O, Kostadinov I, Nattkemper TW, Printzen C, Renz J, Rybalka N, Stadler M, Weibulat T, Wilke T, Renner SS, Vences M. Repositories for Taxonomic Data: Where We Are and What is Missing. Syst Biol 2020; 69:1231-1253. [PMID: 32298457 PMCID: PMC7584136 DOI: 10.1093/sysbio/syaa026] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Revised: 02/20/2020] [Accepted: 03/24/2020] [Indexed: 12/05/2022] Open
Abstract
Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000-20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term-ideally perpetual-data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach-linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated $ \le $2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000-40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.].
Collapse
Affiliation(s)
- Aurélien Miralles
- Departement Origins and Evolution, Institut Systématique, Evolution, Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, 57 rue Cuvier, CP50, 75005 Paris, France
- Systematic Botany and Mycology, University of Munich (LMU), Menzingerstraße 67, 80638 Munich, Germany
| | - Teddy Bruy
- Departement Origins and Evolution, Institut Systématique, Evolution, Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, 57 rue Cuvier, CP50, 75005 Paris, France
- Systematic Botany and Mycology, University of Munich (LMU), Menzingerstraße 67, 80638 Munich, Germany
| | - Katherine Wolcott
- Systematic Botany and Mycology, University of Munich (LMU), Menzingerstraße 67, 80638 Munich, Germany
- National Museum of Natural History, Smithsonian Institution, Washington, DC, USA
| | - Mark D Scherz
- Department of Herpetology, Zoologische Staatssammlung München (ZSM-SNSB), Münchhausenstraße 21, 81247 München, Germany
- Department of Biology, Universität Konstanz, Universitätstraße 10, 78464 Konstanz, Germany
| | - Dominik Begerow
- Department of Geobotany, Ruhr-University Bochum, Universitätsstraße 150, 44780 Bochum, Germany
| | - Bank Beszteri
- Department of Phycology, Faculty of Biology, University of Duisburg-Essen, Universitätsstraße 2, 45141 Essen, Germany
| | - Michael Bonkowski
- Department of Terrestrial Ecology, Center of Excellence in Plant Sciences (CEPLAS), Terrestrial Ecology, Institute of Zoology, University of Cologne, 50674 Köln, Germany
| | - Janine Felden
- MARUM - Center for Marine Environmental Sciences, University of Bremen, Leobenerstraße 8, 28359 Bremen, Germany
- Alfred Wegener Institute - Helmholtz Center for Polar- and Marine Research, Am Handelshafen 12, 27570 Bremerhaven, Germany
| | - Birgit Gemeinholzer
- Department of Systematic Botany, Justus Liebig University Gießen, Heinrich-Buff Ring 38, 35392 Giessen, Germany
| | - Frank Glaw
- Department of Herpetology, Zoologische Staatssammlung München (ZSM-SNSB), Münchhausenstraße 21, 81247 München, Germany
| | - Frank Oliver Glöckner
- Alfred Wegener Institute - Helmholtz Center for Polar- and Marine Research, Am Handelshafen 12, 27570 Bremerhaven, Germany
| | - Oliver Hawlitschek
- Department of Herpetology, Zoologische Staatssammlung München (ZSM-SNSB), Münchhausenstraße 21, 81247 München, Germany
- Department of Scientific Infrastructure, Centrum für Naturkunde (CeNak), Universität Hamburg, Martin-Luther-King-Platz 3, 20146 Hamburg, Germany
| | - Ivaylo Kostadinov
- GFBio - Gesellschaft für Biologische Daten e.V., c/o Research II, Campus Ring 1, 28759 Bremen, Germany
| | - Tim W Nattkemper
- Biodata Mining Group, Center of Biotechnology (CeBiTec), Bielefeld University, PO Box 100131, 33501 Bielefeld, Germany
| | - Christian Printzen
- Department of Botany and Molecular Evolution, Senckenberg Research Institute and Natural History Museum Frankfurt, Senckenberganlage 25, 60325 Frankfurt/Main, Germany
| | - Jasmin Renz
- Zooplankton Research Group, DZMB – Senckenberg am Meer, Martin-Luther-King Platz 3, 20146 Hamburg, Germany
| | - Nataliya Rybalka
- Department of Experimental Phycology and Culture Collection of Algae, University Göttingen, Nikolausberger-Weg 18, 37073 Göttingen, Germany
| | - Marc Stadler
- Department Microbial Drugs, Helmholtz Centre for Infection Research (HZI), and German Centre for Infection Research (DZIF), Partner Site Hannover-Braunschweig, Inhoffenstrasse 7, 38124 Braunschweig, Germany
| | - Tanja Weibulat
- GFBio - Gesellschaft für Biologische Daten e.V., c/o Research II, Campus Ring 1, 28759 Bremen, Germany
| | - Thomas Wilke
- Department of Animal Ecology and Systematics, Justus Liebig University Gießen, Heinrich-Buff Ring 26, 35392 Giessen, Germany
| | - Susanne S Renner
- Systematic Botany and Mycology, University of Munich (LMU), Menzingerstraße 67, 80638 Munich, Germany
| | - Miguel Vences
- Department of Evolutionary Biology, Zoological Institute, Technische Universität Braunschweig, Mendelssohnstraße 4, 38106 Braunschweig, Germany
| |
Collapse
|
2
|
Borsch T, Stevens AD, Häffner E, Güntsch A, Berendsohn WG, Appelhans M, Barilaro C, Beszteri B, Blattner F, Bossdorf O, Dalitz H, Dressler S, Duque-Thüs R, Esser HJ, Franzke A, Goetze D, Grein M, Grünert U, Hellwig F, Hentschel J, Hörandl E, Janßen T, Jürgens N, Kadereit G, Karisch T, Koch M, Müller F, Müller J, Ober D, Porembski S, Poschlod P, Printzen C, Röser M, Sack P, Schlüter P, Schmidt M, Schnittler M, Scholler M, Schultz M, Seeber E, Simmel J, Stiller M, Thiv M, Thüs H, Tkach N, Triebel D, Warnke U, Weibulat T, Wesche K, Yurkov A, Zizka G. A complete digitization of German herbaria is possible, sensible and should be started now. RIO 2020. [DOI: 10.3897/rio.6.e50675] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Plants, fungi and algae are important components of global biodiversity and are fundamental to all ecosystems. They are the basis for human well-being, providing food, materials and medicines. Specimens of all three groups of organisms are accommodated in herbaria, where they are commonly referred to as botanical specimens.
The large number of specimens in herbaria provides an ample, permanent and continuously improving knowledge base on these organisms and an indispensable source for the analysis of the distribution of species in space and time critical for current and future research relating to global biodiversity. In order to make full use of this resource, a research infrastructure has to be built that grants comprehensive and free access to the information in herbaria and botanical collections in general. This can be achieved through digitization of the botanical objects and associated data.
The botanical research community can count on a long-standing tradition of collaboration among institutions and individuals. It agreed on data standards and standard services even before the advent of computerization and information networking, an example being the Index Herbariorum as a global registry of herbaria helping towards the unique identification of specimens cited in the literature.
In the spirit of this collaborative history, 51 representatives from 30 institutions advocate to start the digitization of botanical collections with the overall wall-to-wall digitization of the flat objects stored in German herbaria. Germany has 70 herbaria holding almost 23 million specimens according to a national survey carried out in 2019. 87% of these specimens are not yet digitized. Experiences from other countries like France, the Netherlands, Finland, the US and Australia show that herbaria can be comprehensively and cost-efficiently digitized in a relatively short time due to established workflows and protocols for the high-throughput digitization of flat objects.
Most of the herbaria are part of a university (34), fewer belong to municipal museums (10) or state museums (8), six herbaria belong to institutions also supported by federal funds such as Leibniz institutes, and four belong to non-governmental organizations. A common data infrastructure must therefore integrate different kinds of institutions.
Making full use of the data gained by digitization requires the set-up of a digital infrastructure for storage, archiving, content indexing and networking as well as standardized access for the scientific use of digital objects. A standards-based portfolio of technical components has already been developed and successfully tested by the Biodiversity Informatics Community over the last two decades, comprising among others access protocols, collection databases, portals, tools for semantic enrichment and annotation, international networking, storage and archiving in accordance with international standards. This was achieved through the funding by national and international programs and initiatives, which also paved the road for the German contribution to the Global Biodiversity Information Facility (GBIF).
Herbaria constitute a large part of the German botanical collections that also comprise living collections in botanical gardens and seed banks, DNA- and tissue samples, specimens preserved in fluids or on microscope slides and more. Once the herbaria are digitized, these resources can be integrated, adding to the value of the overall research infrastructure. The community has agreed on tasks that are shared between the herbaria, as the German GBIF model already successfully demonstrates.
We have compiled nine scientific use cases of immediate societal relevance for an integrated infrastructure of botanical collections. They address accelerated biodiversity discovery and research, biomonitoring and conservation planning, biodiversity modelling, the generation of trait information, automated image recognition by artificial intelligence, automated pathogen detection, contextualization by interlinking objects, enabling provenance research, as well as education, outreach and citizen science.
We propose to start this initiative now in order to valorize German botanical collections as a vital part of a worldwide biodiversity data pool.
Collapse
|
3
|
Gemeinholzer B, Vences M, Beszteri B, Bruy T, Felden J, Kostadinov I, Miralles A, Nattkemper TW, Printzen C, Renz J, Rybalka N, Schuster T, Weibulat T, Wilke T, Renner SS. Data storage and data re-use in taxonomy—the need for improved storage and accessibility of heterogeneous data. ORG DIVERS EVOL 2020. [DOI: 10.1007/s13127-019-00428-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
AbstractThe ability to rapidly generate and share molecular, visual, and acoustic data, and to compare them with existing information, and thereby to detect and name biological entities is fundamentally changing our understanding of evolutionary relationships among organisms and is also impacting taxonomy. Harnessing taxonomic data for rapid, automated species identification by machine learning tools or DNA metabarcoding techniques has great potential but will require their review, accessible storage, comprehensive comparison, and integration with prior knowledge and information. Currently, data production, management, and sharing in taxonomic studies are not keeping pace with these needs. Indeed, a survey of recent taxonomic publications provides evidence that few species descriptions in zoology and botany incorporate DNA sequence data. The use of modern high-throughput (-omics) data is so far the exception in alpha-taxonomy, although they are easily stored in GenBank and similar databases. By contrast, for the more routinely used image data, the problem is that they are rarely made available in openly accessible repositories. Improved sharing and re-using of both types of data requires institutions that maintain long-term data storage and capacity with workable, user-friendly but highly automated pipelines. Top priority should be given to standardization and pipeline development for the easy submission and storage of machine-readable data (e.g., images, audio files, videos, tables of measurements). The taxonomic community in Germany and the German Federation for Biological Data are researching options for a higher level of automation, improved linking among data submission and storage platforms, and for making existing taxonomic information more readily accessible.
Collapse
|
4
|
Harjes J, Link A, Weibulat T, Triebel D, Rambold G. FAIR digital objects in environmental and life sciences should comprise workflow operation design data and method information for repeatability of study setups and reproducibility of results. Database (Oxford) 2020; 2020:5894776. [PMID: 32815545 PMCID: PMC7439577 DOI: 10.1093/database/baaa059] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2020] [Revised: 07/01/2020] [Accepted: 07/07/2020] [Indexed: 12/23/2022]
Abstract
Repeatability of study setups and reproducibility of research results by underlying data are major requirements in science. Until now, abstract models for describing the structural logic of studies in environmental sciences are lacking and tools for data management are insufficient. Mandatory for repeatability and reproducibility is the use of sophisticated data management solutions going beyond data file sharing. Particularly, it implies maintenance of coherent data along workflows. Design data concern elements from elementary domains of operations being transformation, measurement and transaction. Operation design elements and method information are specified for each consecutive workflow segment from field to laboratory campaigns. The strict linkage of operation design element values, operation values and objects is essential. For enabling coherence of corresponding objects along consecutive workflow segments, the assignment of unique identifiers and the specification of their relations are mandatory. The abstract model presented here addresses these aspects, and the software DiversityDescriptions (DWB-DD) facilitates the management of thusly connected digital data objects and structures. DWB-DD allows for an individual specification of operation design elements and their linking to objects. Two workflow design use cases, one for DNA barcoding and another for cultivation of fungal isolates, are given. To publish those structured data, standard schema mapping and XML-provision of digital objects are essential. Schemas useful for this mapping include the Ecological Markup Language, the Schema for Meta-omics Data of Collection Objects and the Standard for Structured Descriptive Data. Data pipelines with DWB-DD include the mapping and conversion between schemas and functions for data publishing and archiving according to the Open Archival Information System standard. The setting allows for repeatability of study setups, reproducibility of study results and for supporting work groups to structure and maintain their data from the beginning of a study. The theory of ‘FAIR++’ digital objects is introduced.
Collapse
Affiliation(s)
- Janno Harjes
- University of Bayreuth, Universitätsstraße 30, 95440 Bayreuth, Germany
| | - Anton Link
- Staatliche Naturwissenschaftliche Sammlungen Bayerns, Menzinger Straße 67, 80638 München, Germany
| | - Tanja Weibulat
- Staatliche Naturwissenschaftliche Sammlungen Bayerns, Menzinger Straße 67, 80638 München, Germany.,German Federation for Biological Data e. V., Campus Ring 1, 28759 Bremen, Germany
| | - Dagmar Triebel
- Staatliche Naturwissenschaftliche Sammlungen Bayerns, Menzinger Straße 67, 80638 München, Germany.,German Federation for Biological Data e. V., Campus Ring 1, 28759 Bremen, Germany
| | - Gerhard Rambold
- University of Bayreuth, Universitätsstraße 30, 95440 Bayreuth, Germany
| |
Collapse
|