1
|
Noll NW, Scherber C, Schäffler L. taxalogue: a toolkit to create comprehensive CO1 reference databases. PeerJ 2023; 11:e16253. [PMID: 38077427 PMCID: PMC10702336 DOI: 10.7717/peerj.16253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 09/18/2023] [Indexed: 12/18/2023] Open
Abstract
Background Taxonomic identification through DNA barcodes gained considerable traction through the invention of next-generation sequencing and DNA metabarcoding. Metabarcoding allows for the simultaneous identification of thousands of organisms from bulk samples with high taxonomic resolution. However, reliable identifications can only be achieved with comprehensive and curated reference databases. Therefore, custom reference databases are often created to meet the needs of specific research questions. Due to taxonomic inconsistencies, formatting issues, and technical difficulties, building a custom reference database requires tremendous effort. Here, we present taxalogue, an easy-to-use software for creating comprehensive and customized reference databases that provide clean and taxonomically harmonized records. In combination with extensive geographical filtering options, taxalogue opens up new possibilities for generating and testing evolutionary hypotheses. Methods taxalogue collects DNA sequences from several online sources and combines them into a reference database. Taxonomic incongruencies between the different data sources can be harmonized according to available taxonomies. Dereplication and various filtering options are available regarding sequence quality or metadata information. taxalogue is implemented in the open-source Ruby programming language, and the source code is available at https://github.com/nwnoll/taxalogue. We benchmark four reference databases by sequence identity against eight queries from different localities and trapping devices. Subsamples from each reference database were used to compare how well another one is covered. Results taxalogue produces reference databases with the best coverage at high identities for most tested queries, enabling more accurate, reliable predictions with higher certainty than the other benchmarked reference databases. Additionally, the performance of taxalogue is more consistent while providing good coverage for a variety of habitats, regions, and sampling methods. taxalogue simplifies the creation of reference databases and makes the process reproducible and transparent. Multiple available output formats for commonly used downstream applications facilitate the easy adoption of taxalogue in many different software pipelines. The resulting reference databases improve the taxonomic classification accuracy through high coverage of the query sequences at high identities.
Collapse
Affiliation(s)
- Niklas W. Noll
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| | - Christoph Scherber
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| | - Livia Schäffler
- Centre for Biodiversity Monitoring and Conservation Science, Leibniz Institute for the Analysis of Biodiversity Change, Bonn, North Rhine-Westphalia, Germany
| |
Collapse
|
2
|
Hidayat DS, Sensuse DI, Elisabeth D, Hasani LM. Conceptual model of knowledge management system for scholarly publication cycle in academic institution. VINE JOURNAL OF INFORMATION AND KNOWLEDGE MANAGEMENT SYSTEMS 2022. [DOI: 10.1108/vjikms-08-2021-0163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Purpose
Study on knowledge-based systems for scientific publications is growing very broadly. However, most of these studies do not explicitly discuss the knowledge management (KM) component as knowledge management system (KMS) implementation. This background causes academic institutions to face challenges in developing KMS to support scholarly publication cycle (SPC). Therefore, this study aims to develop a new KMS conceptual model, Identify critical components and provide research gap opportunities for future KM studies on SPC.
Design/methodology/approach
This study used a systematic literature review (SLR) method with the procedure from Kitchenham et al. Then, the SLR results are compiled into a conceptual model design based on a framework on KM foundations and KM solutions. Finally, the model design was validated through interviews with related field experts.
Findings
The KMS for SPC focuses on the discovery, sharing and application of knowledge. The majority of KMS use recommendation systems technology with content-based filtering and collaborative filtering personalization approaches. The characteristics data used in KMS for SPC are structured and unstructured. Metadata and article abstracts are considered sufficiently representative of the entire article content to be used as a search tool and can provide recommendations. The KMS model for SPC has layers of KM infrastructure, processes, systems, strategies, outputs and outcomes.
Research limitations/implications
This study has limitations in discussing tacit knowledge. In contrast, tacit knowledge for SPC is essential for scientific publication performance. The tacit knowledge includes experience in searching, writing, submitting, publishing and disseminating scientific publications. Tacit knowledge plays a vital role in the development of knowledge sharing system (KSS) and KCS. Therefore, KSS and KCS for SPC are still very challenging to be researched in the future. KMS opportunities that might be developed further are lessons learned databases and interactive forums that capture tacit knowledge about SPC. Future work potential could identify other types of KMS in academia and focus more on SPC.
Originality/value
This study proposes a novel comprehensive KMS model to support scientific publication performance. This model has a critical path as a KMS implementation solution for SPC. This model proposes and recommends appropriate components for SPC requirements (KM processes, technology, methods/techniques and data). This study also proposes novel research gaps as KMS research opportunities for SPC in the future.
Collapse
|
3
|
Agosti D, Benichou L, Addink W, Arvanitidis C, Catapano T, Cochrane G, Dillen M, Döring M, Georgiev T, Gérard I, Groom Q, Kishor P, Kroh A, Kvaček J, Mergen P, Mietchen D, Pauperio J, Sautter G, Penev L. Recommendations for use of annotations and persistent identifiers in taxonomy and biodiversity publishing. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e97374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The paper summarises many years of discussions and experience of biodiversity publishers, organisations, research projects and individual researchers, and proposes recommendations for implementation of persistent identifiers for article metadata, structural elements (sections, subsections, figures, tables, references, supplementary materials and others) and data specific to biodiversity (taxonomic treatments, treatment citations, taxon names, material citations, gene sequences, specimens, scientific collections) in taxonomy and biodiversity publishing. The paper proposes best practices on how identifiers should be used in the different cases and on how they can be minted, cited, and expressed in the backend article XML to facilitate conversion to and further re-use of the article content as FAIR data. The paper also discusses several specific routes for post-publication re-use of semantically enhanced content through large biodiversity data aggregators such as the Global Biodiversity Information Facility (GBIF), the International Nucleotide Sequence Database Collaboration (INSDC) and others, and proposes specifications of both identifiers and XML tags to be used for that purpose. A summary table provides an account and overview of the recommendations. The guidelines are supported with examples from the existing publishing practices.
Collapse
|
4
|
Patel A, Jain S, Debnath NC, Lama V. InBiodiv-O. INTERNATIONAL JOURNAL OF INFORMATION SYSTEM MODELING AND DESIGN 2022. [DOI: 10.4018/ijismd.315021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
To present the biodiversity information, a semantic model is required that connects all kinds of data about living creatures and their habitats. The model must be able to encode human knowledge for machines to be understood. Ontology offers the richest machine-interpretable semantics that are being extensively used in the biodiversity domain. Various ontologies are developed for the biodiversity domain; however, these ontologies are not capable to define the Indian biodiversity information though India is one of the megadiverse countries. To semantically analyze the Indian biodiversity information, it is crucial to build an ontology that describes all the terms of this domain. Since the curation of the ontology depends on the domain where these are used, there is no ideal methodology defined yet. The aim of this article is to develop an ontology that semantically encodes all the terms of Indian biodiversity information in all its dimensions based on the proposed methodology. The evaluation of the proposed ontology depicts that ontology is well built in the specified domain.
Collapse
Affiliation(s)
| | - Sarika Jain
- National Institute of Technology, Kurukshetra, India
| | | | | |
Collapse
|
5
|
Stocker M, Heger T, Schweidtmann A, Ćwiek-Kupczyńska H, Penev L, Dojchinovski M, Willighagen E, Vidal ME, Turki H, Balliet D, Tiddi I, Kuhn T, Mietchen D, Karras O, Vogt L, Hellmann S, Jeschke J, Krajewski P, Auer S. SKG4EOSC - Scholarly Knowledge Graphs for EOSC: Establishing a backbone of knowledge graphs for FAIR Scholarly Information in EOSC. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e83789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In the age of advanced information systems powering fast-paced knowledge economies that face global societal challenges, it is no longer adequate to express scholarly information - an essential resource for modern economies - primarily as article narratives in document form. Despite being a well-established tradition in scholarly communication, PDF-based text publishing is hindering scientific progress as it buries scholarly information into non-machine-readable formats. The key objective of SKG4EOSC is to improve science productivity through development and implementation of services for text and data conversion, and production, curation, and re-use of FAIR scholarly information. This will be achieved by (1) establishing the Open Research Knowledge Graph (ORKG, orkg.org), a service operated by the SKG4EOSC coordinator, as a Hub for access to FAIR scholarly information in the EOSC; (2) lifting to EOSC of numerous and heterogeneous domain-specific research infrastructures through the ORKG Hub’s harmonized access facilities; and (3) leverage the Hub to support cross-disciplinary research and policy decisions addressing societal challenges. SKG4EOSC will pilot the devised approaches and technologies in four research domains: biodiversity crisis, precision oncology, circular processes, and human cooperation. With the aim to improve machine-based scholarly information use, SKG4EOSC addresses an important current and future need of researchers. It extends the application of the FAIR data principles to scholarly communication practices, hence a more comprehensive coverage of the entire research lifecycle. Through explicit, machine actionable provenance links between FAIR scholarly information, primary data and contextual entities, it will substantially contribute to reproducibility, validation and trust in science. The resulting advanced machine support will catalyse new discoveries in basic research and solutions in key application areas.
Collapse
|
6
|
Penev L, Koureas D, Groom Q, Lanfear J, Agosti D, Casino A, Miller J, Arvanitidis C, Cochrane G, Hobern D, Banki O, Addink W, Kõljalg U, Copas K, Mergen P, Güntsch A, Benichou L, Benito Gonzalez Lopez J, Ruch P, Martin C, Barov B, Hristova K. Biodiversity Community Integrated Knowledge Library (BiCIKL). RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e81136] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
BiCIKL is an European Union Horizon 2020 project that will initiate and build a new European starting community of key research infrastructures, establishing open science practices in the domain of biodiversity through provision of access to data, associated tools and services at each separate stage of and along the entire research cycle. BiCIKL will provide new methods and workflows for an integrated access to harvesting, liberating, linking, accessing and re-using of subarticle-level data (specimens, material citations, samples, sequences, taxonomic names, taxonomic treatments, figures, tables) extracted from literature. BiCIKL will provide for the first time access and tools for seamless linking and usage tracking of data along the line: specimens > sequences > species > analytics > publications > biodiversity knowledge graph > re-use.
Collapse
|
7
|
People, Projects, Organizations, and Products: Designing a Knowledge Graph to Support Multi-Stakeholder Environmental Planning and Design. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2021. [DOI: 10.3390/ijgi10120823] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
As the need for more broad-scale solutions to environmental problems is increasingly recognized, traditional hierarchical, government-led models of coordination are being supplemented by or transformed into more collaborative inter-organizational networks (i.e., collaboratives, coalitions, partnerships). As diffuse networks, such regional environmental planning and design (REPD) efforts often face challenges in sharing and using spatial and other types of information. Recent advances in semantic knowledge management technologies, such as knowledge graphs, have the potential to address these challenges. In this paper, we first describe the information needs of three multi-stakeholder REPD initiatives in the western USA using a list of 80 need-to-know questions and concerns. The top needs expressed were for help in tracking the participants, institutions, and information products relevant to the REDP’s focus. To address these needs, we developed a prototype knowledge graph based on RDF and GeoSPARQL standards. This semantic approach provided a more flexible data structure than traditional relational databases and also functionality to query information across different providers; however, the lack of semantic data expertise, the complexity of existing software solutions, and limited online hosting options are significant barriers to adoption. These same barriers are more acute for geospatial data, which also faces the added challenge of maintaining and synchronizing both semantic and traditional geospatial datastores.
Collapse
|
8
|
Pyle RL, Barik SK, Christidis L, Conix S, Costello MJ, van Dijk PP, Garnett ST, Hobern D, Kirk PM, Lien AM, Orrell TM, Remsen D, Thomson SA, Wambiji N, Zachos FE, Zhang ZQ, Thiele KR. Towards a global list of accepted species V. The devil is in the detail. ORG DIVERS EVOL 2021. [DOI: 10.1007/s13127-021-00504-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
9
|
Dimitrova M, Senderov VE, Georgiev T, Zhelezov G, Penev L. Infrastructure and Population of the OpenBiodiv Biodiversity Knowledge Graph. Biodivers Data J 2021; 9:e67671. [PMID: 34690512 PMCID: PMC8486731 DOI: 10.3897/bdj.9.e67671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 09/08/2021] [Indexed: 11/12/2022] Open
Abstract
Background OpenBiodiv is a biodiversity knowledge graph containing a synthetic linked open dataset, OpenBiodiv-LOD, which combines knowledge extracted from academic literature with the taxonomic backbone used by the Global Biodiversity Information Facility. The linked open data is modelled according to the OpenBiodiv-O ontology integrating semantic resource types from recognised biodiversity and publishing ontologies with OpenBiodiv-O resource types, introduced to capture the semantics of resources not modelled before. New information We introduce the new release of the OpenBiodiv-LOD attained through information extraction and modelling of additional biodiversity entities. It was achieved by further developments to OpenBiodiv-O, the data storage infrastructure and the workflow and accompanying R software packages used for transformation of academic literature into Resource Description Framework (RDF). We discuss how to utilise the LOD in biodiversity informatics and give examples by providing solutions to several competency questions. We investigate performance issues that arise due to the large amount of inferred statements in the graph and conclude that OWL-full inference is impractical for the project and that unnecessary inference should be avoided.
Collapse
Affiliation(s)
- Mariya Dimitrova
- Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria Institute of Information and Communication Technologies, Bulgarian Academy of Sciences Sofia Bulgaria.,Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria
| | - Viktor E Senderov
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden Department of Bioinformatics and Genetics, Swedish Museum of Natural History Stockholm Sweden
| | - Teodor Georgiev
- Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria
| | - Georgi Zhelezov
- Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria
| | - Lyubomir Penev
- Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria.,Institute of Biodiversity & Ecosystem Research, Bulgarian Academy of Sciences, Sofia, Bulgaria Institute of Biodiversity & Ecosystem Research, Bulgarian Academy of Sciences Sofia Bulgaria
| |
Collapse
|
10
|
Upham NS, Poelen JH, Paul D, Groom QJ, Simmons NB, Vanhove MPM, Bertolino S, Reeder DM, Bastos-Silveira C, Sen A, Sterner B, Franz NM, Guidoti M, Penev L, Agosti D. Liberating host-virus knowledge from biological dark data. Lancet Planet Health 2021; 5:e746-e750. [PMID: 34562356 PMCID: PMC8457912 DOI: 10.1016/s2542-5196(21)00196-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 06/20/2021] [Accepted: 07/06/2021] [Indexed: 05/18/2023]
Abstract
Connecting basic data about bats and other potential hosts of SARS-CoV-2 with their ecological context is crucial to the understanding of the emergence and spread of the virus. However, when lockdowns in many countries started in March, 2020, the world's bat experts were locked out of their research laboratories, which in turn impeded access to large volumes of offline ecological and taxonomic data. Pandemic lockdowns have brought to attention the long-standing problem of so-called biological dark data: data that are published, but disconnected from digital knowledge resources and thus unavailable for high-throughput analysis. Knowledge of host-to-virus ecological interactions will be biased until this challenge is addressed. In this Viewpoint, we outline two viable solutions: first, in the short term, to interconnect published data about host organisms, viruses, and other pathogens; and second, to shift the publishing framework beyond unstructured text (the so-called PDF prison) to labelled networks of digital knowledge. As the indexing system for biodiversity data, biological taxonomy is foundational to both solutions. Building digitally connected knowledge graphs of host-pathogen interactions will establish the agility needed to quickly identify reservoir hosts of novel zoonoses, allow for more robust predictions of emergence, and thereby strengthen human and planetary health systems.
Collapse
Affiliation(s)
- Nathan S Upham
- School of Life Sciences, Arizona State University, Tempe, AZ, USA.
| | - Jorrit H Poelen
- Ronin Institute for Independent Scholarship, Montclair, NJ, USA; Cheadle Center for Biodiversity and Ecological Restoration, University of California Santa Barbara, Santa Barbara, CA, USA
| | - Deborah Paul
- Illinois Natural History Survey, University of Illinois Urbana-Champaign, Champaign, IL, USA
| | | | - Nancy B Simmons
- Department of Mammalogy, Division of Vertebrate Zoology, American Museum of Natural History, New York, NY, USA
| | - Maarten P M Vanhove
- Zoology, Biodiversity and Toxicology, Centre for Environmental Sciences, Hasselt University, Diepenbeek, Belgium
| | - Sandro Bertolino
- Department of Life Sciences and Systems Biology, University of Turin, Turin, Italy
| | - DeeAnn M Reeder
- Department of Biology, Bucknell University, Lewisburg, PA, USA
| | | | - Atriya Sen
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Beckett Sterner
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Nico M Franz
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | | | | | | |
Collapse
|
11
|
Sterner B, Upham N, Gupta P, Powell C, Franz N. Wanted: Standards for FAIR taxonomic concept representations and relationships. BIODIVERSITY INFORMATION SCIENCE AND STANDARDS 2021; 5. [PMID: 35462676 PMCID: PMC9028594 DOI: 10.3897/biss.5.75587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones.
Collapse
|
12
|
Bourgoin T, Bailly N, Zaragueta R, Vignes-Lebbe R. Complete formalization of taxa with their names, contents and descriptions improves taxonomic databases and access to the taxonomic knowledge they support. SYST BIODIVERS 2021. [DOI: 10.1080/14772000.2021.1915895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Thierry Bourgoin
- Muséum national d'Histoire naturelle, Institut Systématique, Évolution, Biodiversité (ISYEB), UMR 7205 MNHN-CNRS-Sorbonne Université-EPHE-Université des Antilles, Paris, 75005 France
| | - Nicolas Bailly
- Beaty Biodiversity Museum - Department of Zoology, University of British Columbia, Vancouver, Canada
| | - René Zaragueta
- Sorbonne Université, Muséum national d’Histoire naturelle, CNRS, EPHE, Université des Antilles, Institut de Systématique Évolution Biodiversité (ISYEB), Paris, 75005 France
| | - Régine Vignes-Lebbe
- Sorbonne Université, Muséum national d’Histoire naturelle, CNRS, EPHE, Université des Antilles, Institut de Systématique Évolution Biodiversité (ISYEB), Paris, 75005 France
| |
Collapse
|
13
|
Incorporating RDA Outputs in the Design of a European Research Infrastructure for Natural Science Collections. DATA SCIENCE JOURNAL 2020. [DOI: 10.5334/dsj-2020-050] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
14
|
Abstract
Knowledge-based biomedical data science involves the design and implementation of computer systems that act as if they knew about biomedicine. Such systems depend on formally represented knowledge in computer systems, often in the form of knowledge graphs. Here we survey recent progress in systems that use formally represented knowledge to address data science problems in both clinical and biological domains, as well as progress on approaches for creating knowledge graphs. Major themes include the relationships between knowledge graphs and machine learning, the use of natural language processing to construct knowledge graphs, and the expansion of novel knowledge-based approaches to clinical and biological domains.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program and Department of Pharmacology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA
| | - Ignacio J Tripodi
- Department of Computer Science, University of Colorado, Boulder, Colorado 80309, USA
| | - Harrison Pielke-Lombardo
- Computational Bioscience Program and Department of Pharmacology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA
| | - Lawrence E Hunter
- Computational Bioscience Program and Department of Pharmacology, University of Colorado Denver Anschutz Medical Campus, Aurora, Colorado 80045, USA
| |
Collapse
|
15
|
Semantic Publication of Agricultural Scientific Literature Using Property Graphs. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10030861] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
During the last decades, there have been significant changes in science that have provoked a big increase in the number of articles published every year. This increment implies a new difficulty for scientists, who have to do an extra effort for selecting literature relevant for their activity. In this work, we present a pipeline for the generation of scientific literature knowledge graphs in the agriculture domain. The pipeline combines Semantic Web and natural language processing technologies, which make data understandable by computer agents, empowering the development of final user applications for literature searches. This workflow consists of (1) RDF generation, including metadata and contents; (2) semantic annotation of the content; and (3) property graph population by adding domain knowledge from ontologies, in addition to the previously generated RDF data describing the articles. This pipeline was applied to a set of 127 agriculture articles, generating a knowledge graph implemented in Neo4j, publicly available on Docker. The potential of our model is illustrated through a series of queries and use cases, which not only include queries about authors or references but also deal with article similarity or clustering based on semantic annotation, which is facilitated by the inclusion of domain ontologies in the graph.
Collapse
|
16
|
OpenBiodiv: A Knowledge Graph for Literature-Extracted Linked Open Data in Biodiversity Science. PUBLICATIONS 2019. [DOI: 10.3390/publications7020038] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Hundreds of years of biodiversity research have resulted in the accumulation of a substantial pool of communal knowledge; however, most of it is stored in silos isolated from each other, such as published articles or monographs. The need for a system to store and manage collective biodiversity knowledge in a community-agreed and interoperable open format has evolved into the concept of the Open Biodiversity Knowledge Management System (OBKMS). This paper presents OpenBiodiv: An OBKMS that utilizes semantic publishing workflows, text and data mining, common standards, ontology modelling and graph database technologies to establish a robust infrastructure for managing biodiversity knowledge. It is presented as a Linked Open Dataset generated from scientific literature. OpenBiodiv encompasses data extracted from more than 5000 scholarly articles published by Pensoft and many more taxonomic treatments extracted by Plazi from journals of other publishers. The data from both sources are converted to Resource Description Framework (RDF) and integrated in a graph database using the OpenBiodiv-O ontology and an RDF version of the Global Biodiversity Information Facility (GBIF) taxonomic backbone. Through the application of semantic technologies, the project showcases the value of open publishing of Findable, Accessible, Interoperable, Reusable (FAIR) data towards the establishment of open science practices in the biodiversity domain.
Collapse
|
17
|
Kopperud BT, Lidgard S, Liow LH. Text-mined fossil biodiversity dynamics using machine learning. Proc Biol Sci 2019; 286:20190022. [PMID: 31014224 PMCID: PMC6501925 DOI: 10.1098/rspb.2019.0022] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 04/02/2019] [Indexed: 01/08/2023] Open
Abstract
Documented occurrences of fossil taxa are the empirical foundation for understanding large-scale biodiversity changes and evolutionary dynamics in deep time. The fossil record contains vast amounts of understudied taxa. Yet the compilation of huge volumes of data remains a labour-intensive impediment to a more complete understanding of Earth's biodiversity history. Even so, many occurrence records of species and genera in these taxa can be uncovered in the palaeontological literature. Here, we extract observations of fossils and their inferred ages from unstructured text in books and scientific articles using machine-learning approaches. We use Bryozoa, a group of marine invertebrates with a rich fossil record, as a case study. Building on recent advances in computational linguistics, we develop a pipeline to recognize taxonomic names and geologic time intervals in published literature and use supervised learning to machine-read whether the species in question occurred in a given age interval. Intermediate machine error rates appear comparable to human error rates in a simple trial, and resulting genus richness curves capture the main features of published fossil diversity studies of bryozoans. We believe our automated pipeline, that greatly reduced the time required to compile our dataset, can help others compile similar data for other taxa.
Collapse
Affiliation(s)
- Bjørn Tore Kopperud
- Natural History Museum, University of Oslo, PO Box 1172, Blindern, 0318 Oslo, Norway
| | - Scott Lidgard
- Integrative Research Center, Field Museum, 1400 South Lake Shore Drive, Chicago IL, 60605, USA
| | - Lee Hsiang Liow
- Natural History Museum, University of Oslo, PO Box 1172, Blindern, 0318 Oslo, Norway
- Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066, Blindern, 0316 Oslo, Norway
| |
Collapse
|
18
|
Abstract
Enormous quantities of biodiversity data are being made available online, but much of this data remains isolated in silos. One approach to breaking these silos is to map local, often database-specific identifiers to shared global identifiers. This mapping can then be used to construct a knowledge graph, where entities such as taxa, publications, people, places, specimens, sequences, and institutions are all part of a single, shared knowledge space. Motivated by the 2018 GBIF Ebbe Nielsen Challenge I explore the feasibility of constructing a “biodiversity knowledge graph” for the Australian fauna. The data cleaning and reconciliation steps involved in constructing the knowledge graph are described in detail. Examples are given of its application to understanding changes in patterns of taxonomic publication over time. A web interface to the knowledge graph (called “Ozymandias”) is available at https://ozymandias-demo.herokuapp.com.
Collapse
|
19
|
Franz NM, Musher LJ, Brown JW, Yu S, Ludäscher B. Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion. PLoS Comput Biol 2019; 15:e1006493. [PMID: 30768597 PMCID: PMC6395011 DOI: 10.1371/journal.pcbi.1006493] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 02/28/2019] [Accepted: 09/10/2018] [Indexed: 11/24/2022] Open
Abstract
Phylogenomic research is accelerating the publication of landmark studies that aim to resolve deep divergences of major organismal groups. Meanwhile, systems for identifying and integrating the products of phylogenomic inference-such as newly supported clade concepts-have not kept pace. However, the ability to verbalize node concept congruence and conflict across multiple, in effect simultaneously endorsed phylogenomic hypotheses, is a prerequisite for building synthetic data environments for biological systematics and other domains impacted by these conflicting inferences. Here we develop a novel solution to the conflict verbalization challenge, based on a logic representation and reasoning approach that utilizes the language of Region Connection Calculus (RCC-5) to produce consistent alignments of node concepts endorsed by incongruent phylogenomic studies. The approach employs clade concept labels to individuate concepts used by each source, even if these carry identical names. Indirect RCC-5 modeling of intensional (property-based) node concept definitions, facilitated by the local relaxation of coverage constraints, allows parent concepts to attain congruence in spite of their differentially sampled children. To demonstrate the feasibility of this approach, we align two recent phylogenomic reconstructions of higher-level avian groups that entail strong conflict in the "neoavian explosion" region. According to our representations, this conflict is constituted by 26 instances of input "whole concept" overlap. These instances are further resolvable in the output labeling schemes and visualizations as "split concepts", which provide the labels and relations needed to build truly synthetic phylogenomic data environments. Because the RCC-5 alignments fundamentally reflect the trained, logic-enabled judgments of systematic experts, future designs for such environments need to promote a culture where experts routinely assess the intensionalities of node concepts published by our peers-even and especially when we are not in agreement with each other.
Collapse
Affiliation(s)
- Nico M. Franz
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Lukas J. Musher
- Richard Gilder Graduate School and Department of Ornithology, American Museum of Natural History, New York, New York, United States of America
| | - Joseph W. Brown
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, United Kingdom
| | - Shizhuo Yu
- Department of Computer Science, University of California at Davis, Davis, California, United States of America
| | - Bertram Ludäscher
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, Illinois, United States of America
| |
Collapse
|
20
|
Muñoz G, Kissling WD, van Loon EE. Biodiversity Observations Miner: A web application to unlock primary biodiversity data from published literature. Biodivers Data J 2019:e28737. [PMID: 30692868 PMCID: PMC6344444 DOI: 10.3897/bdj.7.e28737] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 12/19/2018] [Indexed: 11/28/2022] Open
Abstract
Background A considerable portion of primary biodiversity data is digitally locked inside published literature which is often stored as pdf files. Large-scale approaches to biodiversity science could benefit from retrieving this information and making it digitally accessible and machine-readable. Nonetheless, the amount and diversity of digitally published literature pose many challenges for knowledge discovery and retrieval. Text mining has been extensively used for data discovery tasks in large quantities of documents. However, text mining approaches for knowledge discovery and retrieval have been limited in biodiversity science compared to other disciplines. New information Here, we present a novel, open source text mining tool, the Biodiversity Observations Miner (BOM). This web application, written in R, allows the semi-automated discovery of punctual biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the scientific names present inside a corpus of scientific literature. Furthermore, BOM enable users the rapid screening of large quantities of literature based on word co-occurrences that match custom biodiversity dictionaries. This tool aims to increase the digital mobilisation of primary biodiversity data and is freely accessible via GitHub or through a web server.
Collapse
Affiliation(s)
- Gabriel Muñoz
- NASUA, Biodiversity research and conservation section, Quito, Ecuador NASUA, Biodiversity research and conservation section Quito Ecuador.,Faculty of Arts and Science, Department of Biology, Concordia University, Montreal, Canada Faculty of Arts and Science, Department of Biology, Concordia University Montreal Canada
| | - W Daniel Kissling
- Faculty of Science, Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, Netherlands Faculty of Science, Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam Amsterdam Netherlands
| | - E Emiel van Loon
- Faculty of Science, Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, Netherlands Faculty of Science, Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam Amsterdam Netherlands
| |
Collapse
|
21
|
Page R. Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature. Biodivers Data J 2018:e27539. [PMID: 30065607 PMCID: PMC6066477 DOI: 10.3897/bdj.6.e27539] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 07/11/2018] [Indexed: 11/12/2022] Open
Abstract
Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic namesand identifiers for the taxonomic articles that published those names.
Collapse
Affiliation(s)
- Roderic Page
- University of Glasgow, Glasgow, United Kingdom University of Glasgow Glasgow United Kingdom
| |
Collapse
|
22
|
Johnston MA, Aalbu RL, Franz NM. An updated checklist of the Tenebrionidae sec. Bousquet et al. 2018 of the Algodones Dunes of California, with comments on checklist data practices. Biodivers Data J 2018:e24927. [PMID: 29942173 PMCID: PMC6013544 DOI: 10.3897/bdj.6.e24927] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 06/11/2018] [Indexed: 11/12/2022] Open
Abstract
Generating regional checklists for insects is frequently based on combining data sources ranging from literature and expert assertions that merely imply the existence of an occurrence to aggregated, standard-compliant data of uniquely identified specimens. The increasing diversity of data sources also means that checklist authors are faced with new responsibilities, effectively acting as filterers to select and utilize an expert-validated subset of all available data. Authors are also faced with the technical obstacle to bring more occurrences into Darwin Core-based data aggregation, even if the corresponding specimens belong to external institutions. We illustrate these issues based on a partial update of the Kimsey et al. 2017 checklist of darkling beetles - Tenebrionidae sec. Bousquet et al. 2018 - inhabiting the Algodones Dunes of California. Our update entails 54 species-level concepts for this group and region, of which 31 concepts were found to be represented in three specimen-data aggregator portals, based on our interpretations of the aggregators' data. We reassess the distributions and biogeographic affinities of these species, focusing on taxa that are precinctive (highly geographically restricted) to the Lower Colorado River Valley in the context of recent dune formation from the Colorado River. Throughout, we apply taxonomic concept labels (taxonomic name according to source) to contextualize preferred name usages, but also show that the identification data of aggregated occurrences are very rarely well-contextualized or annotated. Doing so is a pre-requisite for publishing open, dynamic checklist versions that finely accredit incremental expert efforts spent to improve the quality of checklists and aggregated occurrence data.
Collapse
Affiliation(s)
- M Andrew Johnston
- Biodiversity Knowledge Integration Center, Arizona State University, Tempe, AZ, United States of America
| | - Rolf L Aalbu
- California Academy of Sciences, San Francisco, CA, United States of America
| | - Nico M Franz
- Biodiversity Knowledge Integration Center, Arizona State University, Tempe, AZ, United States of America
| |
Collapse
|