1
|
Noll NW, Scherber C, Schäffler L. taxalogue: a toolkit to create comprehensive CO1 reference databases. PeerJ 2023; 11:e16253. [PMID: 38077427 PMCID: PMC10702336 DOI: 10.7717/peerj.16253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 09/18/2023] [Indexed: 12/18/2023] Open
Abstract
Background Taxonomic identification through DNA barcodes gained considerable traction through the invention of next-generation sequencing and DNA metabarcoding. Metabarcoding allows for the simultaneous identification of thousands of organisms from bulk samples with high taxonomic resolution. However, reliable identifications can only be achieved with comprehensive and curated reference databases. Therefore, custom reference databases are often created to meet the needs of specific research questions. Due to taxonomic inconsistencies, formatting issues, and technical difficulties, building a custom reference database requires tremendous effort. Here, we present taxalogue, an easy-to-use software for creating comprehensive and customized reference databases that provide clean and taxonomically harmonized records. In combination with extensive geographical filtering options, taxalogue opens up new possibilities for generating and testing evolutionary hypotheses. Methods taxalogue collects DNA sequences from several online sources and combines them into a reference database. Taxonomic incongruencies between the different data sources can be harmonized according to available taxonomies. Dereplication and various filtering options are available regarding sequence quality or metadata information. taxalogue is implemented in the open-source Ruby programming language, and the source code is available at https://github.com/nwnoll/taxalogue. We benchmark four reference databases by sequence identity against eight queries from different localities and trapping devices. Subsamples from each reference database were used to compare how well another one is covered. Results taxalogue produces reference databases with the best coverage at high identities for most tested queries, enabling more accurate, reliable predictions with higher certainty than the other benchmarked reference databases. Additionally, the performance of taxalogue is more consistent while providing good coverage for a variety of habitats, regions, and sampling methods. taxalogue simplifies the creation of reference databases and makes the process reproducible and transparent. Multiple available output formats for commonly used downstream applications facilitate the easy adoption of taxalogue in many different software pipelines. The resulting reference databases improve the taxonomic classification accuracy through high coverage of the query sequences at high identities.
Collapse
Affiliation(s)
- Niklas W. Noll
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| | - Christoph Scherber
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| | - Livia Schäffler
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| |
Collapse
|
2
|
Seah BKB. Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers. Biodivers Data J 2023; 11:e114076. [PMID: 38312332 PMCID: PMC10838036 DOI: 10.3897/bdj.11.e114076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 11/06/2023] [Indexed: 02/06/2024] Open
Abstract
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
Collapse
Affiliation(s)
- Brandon Kwee Boon Seah
- Thünen Institute for Biodiversity, Braunschweig, GermanyThünen Institute for BiodiversityBraunschweigGermany
| |
Collapse
|
3
|
Page R. Ten years and a million links: building a global taxonomic library connecting persistent identifiers for names, publications and people. Biodivers Data J 2023; 11:e107914. [PMID: 37745899 PMCID: PMC10514697 DOI: 10.3897/bdj.11.e107914] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 09/01/2023] [Indexed: 09/26/2023] Open
Abstract
A major gap in the biodiversity knowledge graph is a connection between taxonomic names and the taxonomic literature. While both names and publications often have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs) or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to DOIs for publications. This article describes efforts to make those connections across three large taxonomic databases: Index Fungorum, International Plant Names Index (IPNI) and the Index of Organism Names (ION). Over a million names have been matched to DOIs or other persistent identifiers for taxonomic publications. This represents approximately 36% of names for which publication data are available. The mappings between LSIDs and publication PIDs are made available through ChecklistBank. Applications of this mapping are discussed, including a web app to locate the citation of a taxonomic name and a knowledge graph that uses data on researcher ORCID ids to connect taxonomic names and publications to authors of those names.
Collapse
Affiliation(s)
- Roderic Page
- University of Glasgow, Glasgow, United KingdomUniversity of GlasgowGlasgowUnited Kingdom
| |
Collapse
|
4
|
Patterson D. The scope and scale of the life sciences (‘Nature’s envelope’). RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e96132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
The extension of biology with a more data-centric component offers new opportunities for discovery. To enable investigations that rely on third-party data, the infrastructure that retains data and allows their re-use should, arguably, enable transactions that relate to any and all biological processes. The assembly of such a service-oriented and enabling infrastructure is challenging. Part of the challenge is to factor in the scope and scale of biological processes. From this foundation can emerge an estimate of the number of discipline-specific centres which will gather data in their given area of interest and prepare them for a path that will lead to trusted, persistent data repositories which will make fit-for-purpose data available for re-use. A simple model is presented for the scope and scale of life sciences. It can accommodate all known processes conducted by or caused by any and all organisms. It is depicted on a grid, the axes of which are (x) the durations of the processes and (y) the sizes of participants involved. Both axes are presented in log10 scales, and the grid is divided into decadal blocks with ten fold increments of time and size. Processes range in duration from 10-17 seconds to 3.5 billion years or more, and the sizes of participants range from 10-15 to 1.3 107 metres. Examples are given to illustrate the diversity of biological processes and their often inexact character. About half of the blocks within the grid do not contain known processes. The blocks that include biological processes amount to ‘Nature’s envelope’, a valuable rhetorical device onto which subdisciplines and existing initiatives may be mapped, and from which can be derived some key requirements for a comprehensive data infrastructure.
Collapse
|
5
|
Seebens H, Kaplan E. DASCO: A workflow to downscale alien species checklists using occurrence records and to re-allocate species distributions across realms. NEOBIOTA 2022. [DOI: 10.3897/neobiota.74.81082] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Information about occurrences of alien species is often provided in so-called checklists, which represents lists of reported alien species in a region. In many cases, available checklists cover whole countries, which is too coarse for many analyses and limits capabilities of assessing status and trends of biological invasions. Information about point-wise occurrences is available in large quantities at online facilities such as GBIF and OBIS, which, however, do not provide information about the invasion status of individual populations. To close this gap, we here provide a semi-automated workflow called DASCO to downscale regional checklists using occurrence records obtained from GBIF and OBIS. Within the workflow, coordinate-based occurrence records for species listed in the provided regional checklists are obtained from GBIF and OBIS, and the status of being an alien population is assigned using the information in the provided checklists. In this way, information in checklists is made available at the local scale, which can then be re-allocated to any other spatial categorisation as provided by the user. In addition, habitats of species are determined to distinguish between marine, brackish, terrestrial, and freshwater species, which allows splitting the provided checklists to the respective realms and ecoregions. By using checklists of global databases, we showcase the usage of the DASCO workflow and revealed > 35 million occurrence records of alien populations in terrestrial and marine regions worldwide, which were back-transformed to terrestrial and marine regions for comparison. DASCO has the potential to be used as a basis for the widely applied species distribution models or assessments of status and trends of biological invasions at large geographic scales. The workflow is implemented in R and in full compliance with the FAIR data principles of open science.
Collapse
|
6
|
Peyton J, Hadjistylli M, Tziortzis I, Erotokritou E, Demetriou M, Samuel Y, Anastasi V, Fyttis G, Hadjioannou L, Ieronymidou C, Kassinis N, Kleitou P, Kletou D, Mandoulaki A, Michailidis N, Papatheodoulou A, Payiattas G, Sparrow D, Sparrow R, Turvey K, Tzirkalli E, Varnava AI, Pescott OL. Using expert-elicitation to deliver biodiversity monitoring priorities on a Mediterranean island. PLoS One 2022; 17:e0256777. [PMID: 35324899 PMCID: PMC8947143 DOI: 10.1371/journal.pone.0256777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 02/24/2022] [Indexed: 11/24/2022] Open
Abstract
Biodiversity monitoring plays an essential role in tracking changes in ecosystems, species distributions and abundances across the globe. Data collected through both structured and unstructured biodiversity recording can inform conservation measures designed to reduce, prevent, and reverse declines in valued biodiversity of many types. However, given that resources for biodiversity monitoring are limited, it is important that funding bodies prioritise investments relative to the requirements in any given region. We addressed this prioritisation requirement for a biodiverse Mediterranean island (Cyprus) using a three-stage process of expert-elicitation. This resulted in a structured list of twenty biodiversity monitoring needs; specifically, a hierarchy of three groups of these needs was created using a consensus approach. The most highly prioritised biodiversity monitoring needs were those related to the development of robust survey methodologies, and those ensuring that sufficiently skilled citizens are available to contribute. We discuss ways that the results of our expert-elicitation process could be used to support current and future biodiversity monitoring in Cyprus.
Collapse
Affiliation(s)
- J. Peyton
- UK Centre for Ecology & Hydrology, Wallingford, United Kingdom
- * E-mail:
| | - M. Hadjistylli
- Department of Agriculture, Ministry of Agriculture, Rural Development and Environment, Lefkosia, Cyprus
| | - I. Tziortzis
- Water Development Department, Ministry of Agriculture, Rural Development and Environment, Lefkosia, Cyprus
| | - E. Erotokritou
- Department of Environment, Ministry of Agriculture, Rural Development and Environment, Lefkosia, Cyprus
| | - M. Demetriou
- Department of Biological Sciences, University of Cyprus, Lefkosia, Cyprus
| | - Y. Samuel
- Department of Biological Sciences, University of Cyprus, Lefkosia, Cyprus
- Oceanography Centre, University of Cyprus, Lefkosia, Cyprus
| | - V. Anastasi
- Terra Cypria - The Cyprus Conservation Foundation, Lefkosia, Cyprus
- BirdLife Cyprus, Nicosia, Cyprus
| | - G. Fyttis
- Department of Biological Sciences, University of Cyprus, Lefkosia, Cyprus
- I.A.CO Environmental & Water Consultants Ltd., Lefkosia, Cyprus
| | - L. Hadjioannou
- Enalia Physis Environmental Research Centre, Lefkosia, Cyprus
- CMMI – Cyprus Marine and Maritime Institute, Larnaca, Cyprus
| | | | - N. Kassinis
- Game and Fauna Service, Ministry of Interior, Lefkosia, Cyprus
| | - P. Kleitou
- Marine & Environmental Research (MER) Lab, Lemesos, Cyprus
- School of Biological and Marine Sciences, University of Plymouth, Plymouth, United Kingdom
| | - D. Kletou
- Marine & Environmental Research (MER) Lab, Lemesos, Cyprus
- Department of Maritime Transport and Commerce, Frederick University, Lemesos, Cyprus
| | - A. Mandoulaki
- Department of Agricultural Sciences, Biotechnology and Food Science, Cyprus University of Technology, Lemesos, Cyprus
| | - N. Michailidis
- Department of Fisheries and Marine Research, Ministry of Agriculture, Rural Development and Environment, Lefkosia, Cyprus
| | | | - G. Payiattas
- Department of Fisheries and Marine Research, Ministry of Agriculture, Rural Development and Environment, Lefkosia, Cyprus
| | - D. Sparrow
- Cyprus Dragonfly Study Group, Pafos, Cyprus
| | - R. Sparrow
- Cyprus Dragonfly Study Group, Pafos, Cyprus
| | - K. Turvey
- UK Centre for Ecology & Hydrology, Wallingford, United Kingdom
| | - E. Tzirkalli
- School of Pure and Applied Sciences, Open University of Cyprus, Nicosia, Cyprus
- Department of Biological Applications and Technology, University of Ioannina, Ioannina, Greece
| | - A. I. Varnava
- Department of Agricultural Sciences, Biotechnology and Food Science, Cyprus University of Technology, Lemesos, Cyprus
| | - O. L. Pescott
- UK Centre for Ecology & Hydrology, Wallingford, United Kingdom
| |
Collapse
|
7
|
Mandeville CP, Koch W, Nilsen EB, Finstad AG. Open Data Practices among Users of Primary Biodiversity Data. Bioscience 2021; 71:1128-1147. [PMID: 34733117 PMCID: PMC8560312 DOI: 10.1093/biosci/biab072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Presence-only biodiversity data are increasingly relied on in biodiversity, ecology, and conservation research, driven by growing digital infrastructures that support open data sharing and reuse. Recent reviews of open biodiversity data have clearly documented the value of data sharing, but the extent to which the biodiversity research community has adopted open data practices remains unclear. We address this question by reviewing applications of presence-only primary biodiversity data, drawn from a variety of sources beyond open databases, in the indexed literature. We characterize how frequently researchers access open data relative to data from other sources, how often they share newly generated or collated data, and trends in metadata documentation and data citation. Our results indicate that biodiversity research commonly relies on presence-only data that are not openly available and neglects to make such data available. Improved data sharing and documentation will increase the value, reusability, and reproducibility of biodiversity research.
Collapse
Affiliation(s)
- Caitlin P Mandeville
- Department of Natural History, Norwegian University of Science and Technology, Trondheim, Norway
| | - Wouter Koch
- Department of Natural History, Norwegian University of Science and Technology, Trondheim, Norway
| | - Erlend B Nilsen
- Faculty of Biosciences and Aquaculture, Nord University, Steinkjer, Norway
| | - Anders G Finstad
- Department of Natural History, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
8
|
Sterner B, Upham N, Gupta P, Powell C, Franz N. Wanted: Standards for FAIR taxonomic concept representations and relationships. BIODIVERSITY INFORMATION SCIENCE AND STANDARDS 2021; 5. [PMID: 35462676 PMCID: PMC9028594 DOI: 10.3897/biss.5.75587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones.
Collapse
|
9
|
Thessen AE, Bogdan P, Patterson DJ, Casey TM, Hinojo-Hinojo C, de Lange O, Haendel MA. From Reductionism to Reintegration: Solving society's most pressing problems requires building bridges between data types across the life sciences. PLoS Biol 2021; 19:e3001129. [PMID: 33770077 PMCID: PMC7997011 DOI: 10.1371/journal.pbio.3001129] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Decades of reductionist approaches in biology have achieved spectacular progress, but the proliferation of subdisciplines, each with its own technical and social practices regarding data, impedes the growth of the multidisciplinary and interdisciplinary approaches now needed to address pressing societal challenges. Data integration is key to a reintegrated biology able to address global issues such as climate change, biodiversity loss, and sustainable ecosystem management. We identify major challenges to data integration and present a vision for a "Data as a Service"-oriented architecture to promote reuse of data for discovery. The proposed architecture includes standards development, new tools and services, and strategies for career-development and sustainability.
Collapse
Affiliation(s)
- Anne E. Thessen
- Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
- * E-mail:
| | - Paul Bogdan
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, United States of America
| | | | - Theresa M. Casey
- Department of Animal Sciences, Purdue University, West Lafayette, Indiana, United States of America
| | - César Hinojo-Hinojo
- Department of Earth System Science, University of California, Irvine, California, United States of America
| | - Orlando de Lange
- Department of Electrical Engineering, University of Washington, Seattle, Washington, United States of America
| | - Melissa A. Haendel
- Department of Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
10
|
Sterner BW, Gilbert EE, Franz NM. Decentralized but Globally Coordinated Biodiversity Data. Front Big Data 2021; 3:519133. [PMID: 33693407 PMCID: PMC7931950 DOI: 10.3389/fdata.2020.519133] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 08/31/2020] [Indexed: 11/22/2022] Open
Abstract
Centralized biodiversity data aggregation is too often failing societal needs due to pervasive and systemic data quality deficiencies. We argue for a novel approach that embodies the spirit of the Web (“small pieces loosely joined”) through the decentralized coordination of data across scientific languages and communities. The upfront cost of decentralization can be offset by the long-term benefit of achieving sustained expert engagement, higher-quality data products, and ultimately more societal impact for biodiversity data. Our decentralized approach encourages the emergence and evolution of multiple self-identifying communities of practice that are regionally, taxonomically, or institutionally localized. Each community is empowered to control the social and informational design and versioning of their local data infrastructures and signals. With no single aggregator to exert centralized control over biodiversity data, decentralization generates loosely connected networks of mid-level aggregators. Global coordination is nevertheless feasible through automatable data sharing agreements that enable efficient propagation and translation of biodiversity data across communities. The decentralized model also poses novel integration challenges, among which the explicit and continuous articulation of conflicting systematic classifications and phylogenies remain the most challenging. We discuss the development of available solutions, challenges, and outline next steps: the global effort of coordination should focus on developing shared languages for data signal translation, as opposed to homogenizing the data signal itself.
Collapse
Affiliation(s)
- Beckett W Sterner
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| | - Edward E Gilbert
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| | - Nico M Franz
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| |
Collapse
|
11
|
Norman KEA, Chamberlain S, Boettiger C. taxadb: A high‐performance local taxonomic database interface. Methods Ecol Evol 2020. [DOI: 10.1111/2041-210x.13440] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Kari E. A. Norman
- Department of Environmental Science, Policy, and Management University of California Berkeley Berkeley CA USA
| | - Scott Chamberlain
- The rOpenSci Project University of California Berkeley Berkeley CA USA
| | - Carl Boettiger
- Department of Environmental Science, Policy, and Management University of California Berkeley Berkeley CA USA
| |
Collapse
|
12
|
Seebens H, Clarke DA, Groom Q, Wilson JRU, García-Berthou E, Kühn I, Roigé M, Pagad S, Essl F, Vicente J, Winter M, McGeoch M. A workflow for standardising and integrating alien species distribution data. NEOBIOTA 2020. [DOI: 10.3897/neobiota.59.53578] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Biodiversity data are being collected at unprecedented rates. Such data often have significant value for purposes beyond the initial reason for which they were collected, particularly when they are combined and collated with other data sources. In the field of invasion ecology, however, integrating data represents a major challenge due to the notorious lack of standardisation of terminologies and categorisations, and the application of deviating concepts of biological invasions. Here, we introduce the SInAS workflow, short for Standardising and Integrating Alien Species data. The SInAS workflow standardises terminologies following Darwin Core, location names using a proposed translation table, taxon names based on the GBIF backbone taxonomy, and dates of first records based on a set of predefined rules. The output of the SInAS workflow provides various entry points that can be used both to improve coherence among the databases and to check and correct the original data. The workflow is flexible and can be easily adapted and extended to the needs of different users. We illustrate the workflow using a case-study integrating five widely used global databases of information on biological invasions. The comparison of the standardised databases revealed a surprisingly low degree of overlap, which indicates that the amount of data may currently not be fully exploited in the original databases. We highly recommend the use and development of publicly available workflows to ensure that the integration of databases is reproducible and transparent. Workflows, such as SInAS, ultimately increase trust in data, study results, and conclusions.
Collapse
|
13
|
Moudrý V, Devillers R. Quality and usability challenges of global marine biodiversity databases: An example for marine mammal data. ECOL INFORM 2020. [DOI: 10.1016/j.ecoinf.2020.101051] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
14
|
Sterner B, Witteveen J, Franz N. Coordinating dissent as an alternative to consensus classification: insights from systematics for bio-ontologies. HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES 2020; 42:8. [PMID: 32030540 DOI: 10.1007/s40656-020-0300-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 01/17/2020] [Indexed: 06/10/2023]
Abstract
The collection and classification of data into meaningful categories is a key step in the process of knowledge making. In the life sciences, the design of data discovery and integration tools has relied on the premise that a formal classificatory system for expressing a body of data should be grounded in consensus definitions for classifications. On this approach, exemplified by the realist program of the Open Biomedical Ontologies Foundry, progress is maximized by grounding the representation and aggregation of data on settled knowledge. We argue that historical practices in systematic biology provide an important and overlooked alternative approach to classifying and disseminating data, based on a principle of coordinative rather than definitional consensus. Systematists have developed a robust system for referring to taxonomic entities that can deliver high quality data discovery and integration without invoking consensus about reality or "settled" science.
Collapse
Affiliation(s)
- Beckett Sterner
- School of Life Sciences, Arizona State University, Tempe, USA.
| | - Joeri Witteveen
- Department of Science Education, Section for History and Philosophy of Science, University of Copenhagen, Copenhagen, Denmark
| | - Nico Franz
- School of Life Sciences, Arizona State University, Tempe, USA
| |
Collapse
|
15
|
Singer RA, Ellis S, Page LM. Awareness and use of biodiversity collections by fish biologists. JOURNAL OF FISH BIOLOGY 2020; 96:297-306. [PMID: 31621077 DOI: 10.1111/jfb.14167] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/08/2019] [Accepted: 10/08/2019] [Indexed: 06/10/2023]
Abstract
A survey of 280 fish biologists from a diverse pool of disciplines was conducted in order to assess the use made of biodiversity collections and how collections can better collect, curate and share the data they have. From the responses, data for how fish biologists use collections, what data they find the most useful, what factors influence the decisions to use collections, how they access the data and explore why some fish biologists make the decision to not use biodiversity collections is collated and reported. The results of which could be used to formulate sustainability plans for collections administrators and staff who curate fish biodiversity collections, while also highlighting the diversity of data and uses to researchers.
Collapse
Affiliation(s)
- Randal A Singer
- University of Michigan Museum of Zoology, University of Michigan, Ann Arbor, Michigan, USA
| | - Shari Ellis
- Florida Museum of Natural History, University of Florida, Gainesville, Florida, USA
| | - Lawrence M Page
- Florida Museum of Natural History, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
16
|
Kopperud BT, Lidgard S, Liow LH. Text-mined fossil biodiversity dynamics using machine learning. Proc Biol Sci 2019; 286:20190022. [PMID: 31014224 PMCID: PMC6501925 DOI: 10.1098/rspb.2019.0022] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 04/02/2019] [Indexed: 01/08/2023] Open
Abstract
Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impediment to a more complete understanding of Earth's biodiversity history. Even so, many occurrence records of species and genera in these taxa can be uncovered in the palaeontological literature. Here, we extract observations of fossils and their inferred ages from unstructured text in books and scientific articles using machine-learning approaches. We use Bryozoa, a group of marine invertebrates with a rich fossil record, as a case study. Building on recent advances in computational linguistics, we develop a pipeline to recognize taxonomic names and geologic time intervals in published literature and use supervised learning to machine-read whether the species in question occurred in a given age interval. Intermediate machine error rates appear comparable to human error rates in a simple trial, and resulting genus richness curves capture the main features of published fossil diversity studies of bryozoans. We believe our automated pipeline, that greatly reduced the time required to compile our dataset, can help others compile similar data for other taxa.
Collapse
Affiliation(s)
- Bjørn Tore Kopperud
- Natural History Museum, University of Oslo, PO Box 1172, Blindern, 0318 Oslo, Norway
| | - Scott Lidgard
- Integrative Research Center, Field Museum, 1400 South Lake Shore Drive, Chicago IL, 60605, USA
| | - Lee Hsiang Liow
- Natural History Museum, University of Oslo, PO Box 1172, Blindern, 0318 Oslo, Norway
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066, Blindern, 0316 Oslo, Norway
| |
Collapse
|
17
|
Chang J, Rabosky DL, Smith SA, Alfaro ME. An
r
package and online resource for macroevolutionary studies using the ray‐finned fish tree of life. Methods Ecol Evol 2019. [DOI: 10.1111/2041-210x.13182] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Jonathan Chang
- School of Biological Sciences Monash University Clayton VIC Australia
| | - Daniel L. Rabosky
- Museum of Zoology Department of Ecology and Evolutionary Biology University of Michigan Ann Arbor MI
| | - Stephen A. Smith
- Museum of Zoology Department of Ecology and Evolutionary Biology University of Michigan Ann Arbor MI
| | - Michael E. Alfaro
- Department of Ecology and Evolutionary BiologyUniversity of CaliforniaLos AngelesCA
| |
Collapse
|
18
|
Abstract
Enormous quantities of biodiversity data are being made available online, but much of this data remains isolated in silos. One approach to breaking these silos is to map local, often database-specific identifiers to shared global identifiers. This mapping can then be used to construct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space. Motivated by the 2018 GBIF Ebbe Nielsen Challenge I explore the feasibility of constructing a “biodiversity knowledge graph” for the Australian fauna. The data cleaning and reconciliation steps involved in constructing the knowledge graph are described in detail. Examples are given of its application to understanding changes in patterns of taxonomic publication over time. A web interface to the knowledge graph (called “Ozymandias”) is available at https://ozymandias-demo.herokuapp.com.
Collapse
|
19
|
Hobern D, Baptiste B, Copas K, Guralnick R, Hahn A, van Huis E, Kim ES, McGeoch M, Naicker I, Navarro L, Noesgaard D, Price M, Rodrigues A, Schigel D, Sheffield CA, Wieczorek J. Connecting data and expertise: a new alliance for biodiversity knowledge. Biodivers Data J 2019; 7:e33679. [PMID: 30886531 PMCID: PMC6420472 DOI: 10.3897/bdj.7.e33679] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Accepted: 03/04/2019] [Indexed: 11/12/2022] Open
Abstract
There has been major progress over the last two decades in digitising historical knowledge of biodiversity and in making biodiversity data freely and openly accessible. Interlocking efforts bring together international partnerships and networks, national, regional and institutional projects and investments and countless individual contributors, spanning diverse biological and environmental research domains, government agencies and non-governmental organisations, citizen science and commercial enterprise. However, current efforts remain inefficient and inadequate to address the global need for accurate data on the world's species and on changing patterns and trends in biodiversity. Significant challenges include imbalances in regional engagement in biodiversity informatics activity, uneven progress in data mobilisation and sharing, the lack of stable persistent identifiers for data records, redundant and incompatible processes for cleaning and interpreting data and the absence of functional mechanisms for knowledgeable experts to curate and improve data. Recognising the need for greater alignment between efforts at all scales, the Global Biodiversity Information Facility (GBIF) convened the second Global Biodiversity Informatics Conference (GBIC2) in July 2018 to propose a coordination mechanism for developing shared roadmaps for biodiversity informatics. GBIC2 attendees reached consensus on the need for a global alliance for biodiversity knowledge, learning from examples such as the Global Alliance for Genomics and Health (GA4GH) and the open software communities under the Apache Software Foundation. These initiatives provide models for multiple stakeholders with decentralised funding and independent governance to combine resources and develop sustainable solutions that address common needs. This paper summarises the GBIC2 discussions and presents a set of 23 complementary ambitions to be addressed by the global community in the context of the proposed alliance. The authors call on all who are responsible for describing and monitoring natural systems, all who depend on biodiversity data for research, policy or sustainable environmental management and all who are involved in developing biodiversity informatics solutions to register interest at https://biodiversityinformatics.org/ and to participate in the next steps to establishing a collaborative alliance. The supplementary materials include brochures in a number of languages (English, Arabic, Spanish, Basque, French, Japanese, Dutch, Portuguese, Russian, Traditional Chinese and Simplified Chinese). These summarise the need for an alliance for biodiversity knowledge and call for collaboration in its establishment.
Collapse
Affiliation(s)
- Donald Hobern
- Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark Global Biodiversity Information Facility Secretariat Copenhagen Denmark
| | - Brigitte Baptiste
- Instituto de Investigación de Recursos Biológicos Alexander von Humboldt, Bogotá, Colombia Instituto de Investigación de Recursos Biológicos Alexander von Humboldt Bogotá Colombia
| | - Kyle Copas
- Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark Global Biodiversity Information Facility Secretariat Copenhagen Denmark
| | - Robert Guralnick
- Vertnet, Florida, United States of America Vertnet Florida United States of America.,University of Colorado, Boulder; University of Colorado Museum of Natural History, Boulder, United States of America University of Colorado, Boulder; University of Colorado Museum of Natural History Boulder United States of America.,Univ. of Florida, Gainesville, United States of America Univ. of Florida Gainesville United States of America
| | - Andrea Hahn
- Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark Global Biodiversity Information Facility Secretariat Copenhagen Denmark
| | - Edwin van Huis
- Naturalis, Amsterdam, Netherlands Naturalis Amsterdam Netherlands
| | - Eun-Shik Kim
- Kookmin University, Seoul, South Korea Kookmin University Seoul South Korea
| | - Melodie McGeoch
- Monash University, Clayton, Australia Monash University Clayton Australia
| | - Isayvani Naicker
- African Academy of Sciences, Nairobi, Kenya African Academy of Sciences Nairobi Kenya
| | - Laetitia Navarro
- German Centre for Integrative Biodiversity Research, Leipzig, Germany German Centre for Integrative Biodiversity Research Leipzig Germany
| | - Daniel Noesgaard
- Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark Global Biodiversity Information Facility Secretariat Copenhagen Denmark
| | - Michelle Price
- Conservatoire et Jardin botaniques de la Ville de Genève, Geneva, Switzerland Conservatoire et Jardin botaniques de la Ville de Genève Geneva Switzerland
| | - Andrew Rodrigues
- Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark Global Biodiversity Information Facility Secretariat Copenhagen Denmark
| | - Dmitry Schigel
- Global Biodiversity Information Facility Secretariat, Copenhagen, Denmark Global Biodiversity Information Facility Secretariat Copenhagen Denmark
| | - Carolyn A Sheffield
- Smithsonian Libraries/Biodiversity Heritage Library, Washington, DC, United States of America Smithsonian Libraries/Biodiversity Heritage Library Washington, DC United States of America
| | - John Wieczorek
- VertNet, Bariloche, Argentina VertNet Bariloche Argentina.,Museum of Vertebrate Zoology, University of California, Berkeley, United States of America Museum of Vertebrate Zoology, University of California Berkeley United States of America
| |
Collapse
|
20
|
Franz NM, Musher LJ, Brown JW, Yu S, Ludäscher B. Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion. PLoS Comput Biol 2019; 15:e1006493. [PMID: 30768597 PMCID: PMC6395011 DOI: 10.1371/journal.pcbi.1006493] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/28/2019] [Accepted: 09/10/2018] [Indexed: 11/24/2022] Open
Abstract
Phylogenomic research is accelerating the publication of landmark studies that aim to resolve deep divergences of major organismal groups. Meanwhile, systems for identifying and integrating the products of phylogenomic inference-such as newly supported clade concepts-have not kept pace. However, the ability to verbalize node concept congruence and conflict across multiple, in effect simultaneously endorsed phylogenomic hypotheses, is a prerequisite for building synthetic data environments for biological systematics and other domains impacted by these conflicting inferences. Here we develop a novel solution to the conflict verbalization challenge, based on a logic representation and reasoning approach that utilizes the language of Region Connection Calculus (RCC-5) to produce consistent alignments of node concepts endorsed by incongruent phylogenomic studies. The approach employs clade concept labels to individuate concepts used by each source, even if these carry identical names. Indirect RCC-5 modeling of intensional (property-based) node concept definitions, facilitated by the local relaxation of coverage constraints, allows parent concepts to attain congruence in spite of their differentially sampled children. To demonstrate the feasibility of this approach, we align two recent phylogenomic reconstructions of higher-level avian groups that entail strong conflict in the "neoavian explosion" region. According to our representations, this conflict is constituted by 26 instances of input "whole concept" overlap. These instances are further resolvable in the output labeling schemes and visualizations as "split concepts", which provide the labels and relations needed to build truly synthetic phylogenomic data environments. Because the RCC-5 alignments fundamentally reflect the trained, logic-enabled judgments of systematic experts, future designs for such environments need to promote a culture where experts routinely assess the intensionalities of node concepts published by our peers-even and especially when we are not in agreement with each other.
Collapse
Affiliation(s)
- Nico M. Franz
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Lukas J. Musher
- Richard Gilder Graduate School and Department of Ornithology, American Museum of Natural History, New York, New York, United States of America
| | - Joseph W. Brown
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, United Kingdom
| | - Shizhuo Yu
- Department of Computer Science, University of California at Davis, Davis, California, United States of America
| | - Bertram Ludäscher
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| |
Collapse
|
21
|
Seltmann K, Lafia S, Paul D, James S, Bloom D, Rios N, Ellis S, Farrell U, Utrup J, Yost M, Davis E, Emery R, Motz G, Kimmig J, Shirey V, Sandall E, Park D, Tyrrell C, Thackurdeen RS, Collins M, O'Leary V, Prestridge H, Evelyn C, Nyberg B. Georeferencing for Research Use (GRU): An integrated geospatial training paradigm for biocollections researchers and data providers. RESEARCH IDEAS AND OUTCOMES 2018. [DOI: 10.3897/rio.4.e32449] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Georeferencing is the process of aligning a text description of a geographic location with a spatial location based on a geographic coordinate system. Training aids are commonly created around the georeferencing process to disseminate community standards and ideas, guide accurate georeferencing, inform users about new tools, and help users evaluate existing geospatial data. The Georeferencing for Research Use (GRU) workshop was implemented as a training aid that focused on the creation and research use of geospatial coordinates, and included both data researchers and data providers, to facilitate communication between the groups. The workshop included 23 participants with a wide background of expertise ranging from students (undergraduate and graduate), professors, researchers and educators, scientific data managers, natural history collections personnel, and spatial analyst specialists. The conversations and survey results from this workshop demonstrate that it is important to provide opportunities for biocollections data providers to interact directly with the researchers using the data they produce and vice versa.
Collapse
|
22
|
Johnston MA, Aalbu RL, Franz NM. An updated checklist of the Tenebrionidae sec. Bousquet et al. 2018 of the Algodones Dunes of California, with comments on checklist data practices. Biodivers Data J 2018:e24927. [PMID: 29942173 PMCID: PMC6013544 DOI: 10.3897/bdj.6.e24927] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 06/11/2018] [Indexed: 11/12/2022] Open
Abstract
Generating regional checklists for insects is frequently based on combining data sources ranging from literature and expert assertions that merely imply the existence of an occurrence to aggregated, standard-compliant data of uniquely identified specimens. The increasing diversity of data sources also means that checklist authors are faced with new responsibilities, effectively acting as filterers to select and utilize an expert-validated subset of all available data. Authors are also faced with the technical obstacle to bring more occurrences into Darwin Core-based data aggregation, even if the corresponding specimens belong to external institutions. We illustrate these issues based on a partial update of the Kimsey et al. 2017 checklist of darkling beetles - Tenebrionidae sec. Bousquet et al. 2018 - inhabiting the Algodones Dunes of California. Our update entails 54 species-level concepts for this group and region, of which 31 concepts were found to be represented in three specimen-data aggregator portals, based on our interpretations of the aggregators' data. We reassess the distributions and biogeographic affinities of these species, focusing on taxa that are precinctive (highly geographically restricted) to the Lower Colorado River Valley in the context of recent dune formation from the Colorado River. Throughout, we apply taxonomic concept labels (taxonomic name according to source) to contextualize preferred name usages, but also show that the identification data of aggregated occurrences are very rarely well-contextualized or annotated. Doing so is a pre-requisite for publishing open, dynamic checklist versions that finely accredit incremental expert efforts spent to improve the quality of checklists and aggregated occurrence data.
Collapse
Affiliation(s)
- M Andrew Johnston
- Biodiversity Knowledge Integration Center, Arizona State University, Tempe, AZ, United States of America
| | - Rolf L Aalbu
- California Academy of Sciences, San Francisco, CA, United States of America
| | - Nico M Franz
- Biodiversity Knowledge Integration Center, Arizona State University, Tempe, AZ, United States of America
| |
Collapse
|
23
|
Mesibov R. An audit of some processing effects in aggregated occurrence records. Zookeys 2018:129-146. [PMID: 29713234 PMCID: PMC5923217 DOI: 10.3897/zookeys.751.24791] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 04/15/2018] [Indexed: 11/12/2022] Open
Abstract
A total of ca 800,000 occurrence records from the Australian Museum (AM), Museums Victoria (MV) and the New Zealand Arthropod Collection (NZAC) were audited for changes in selected Darwin Core fields after processing by the Atlas of Living Australia (ALA; for AM and MV records) and the Global Biodiversity Information Facility (GBIF; for AM, MV and NZAC records). Formal taxon names in the genus- and species-groups were changed in 13-21% of AM and MV records, depending on dataset and aggregator. There was little agreement between the two aggregators on processed names, with names changed in two to three times as many records by one aggregator alone compared to records with names changed by both aggregators. The type status of specimen records did not change with name changes, resulting in confusion as to the name with which a type was associated. Data losses of up to 100% were found after processing in some fields, apparently due to programming errors. The taxonomic usefulness of occurrence records could be improved if aggregators included both original and the processed taxonomic data items for each record. It is recommended that end-users check original and processed records for data loss and name replacements after processing by aggregators.
Collapse
|