1
|
Girón JC, Tarasov S, González Montaña LA, Matentzoglu N, Smith AD, Koch M, Boudinot BE, Bouchard P, Burks R, Vogt L, Yoder M, Osumi-Sutherland D, Friedrich F, Beutel RG, Mikó I. Formalizing Invertebrate Morphological Data: A Descriptive Model for Cuticle-Based Skeleto-Muscular Systems, an Ontology for Insect Anatomy, and their Potential Applications in Biodiversity Research and Informatics. Syst Biol 2023; 72:1084-1100. [PMID: 37094905 DOI: 10.1093/sysbio/syad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Revised: 04/17/2023] [Accepted: 04/21/2023] [Indexed: 04/26/2023] Open
Abstract
The spectacular radiation of insects has produced a stunning diversity of phenotypes. During the past 250 years, research on insect systematics has generated hundreds of terms for naming and comparing them. In its current form, this terminological diversity is presented in natural language and lacks formalization, which prohibits computer-assisted comparison using semantic web technologies. Here we propose a Model for Describing Cuticular Anatomical Structures (MoDCAS) which incorporates structural properties and positional relationships for standardized, consistent, and reproducible descriptions of arthropod phenotypes. We applied the MoDCAS framework in creating the ontology for the Anatomy of the Insect Skeleto-Muscular system (AISM). The AISM is the first general insect ontology that aims to cover all taxa by providing generalized, fully logical, and queryable, definitions for each term. It was built using the Ontology Development Kit (ODK), which maximizes interoperability with Uberon (Uberon multispecies anatomy ontology) and other basic ontologies, enhancing the integration of insect anatomy into the broader biological sciences. A template system for adding new terms, extending, and linking the AISM to additional anatomical, phenotypic, genetic, and chemical ontologies is also introduced. The AISM is proposed as the backbone for taxon-specific insect ontologies and has potential applications spanning systematic biology and biodiversity informatics, allowing users to: 1) use controlled vocabularies and create semiautomated computer-parsable insect morphological descriptions; 2) integrate insect morphology into broader fields of research, including ontology-informed phylogenetic methods, logical homology hypothesis testing, evo-devo studies, and genotype to phenotype mapping; and 3) automate the extraction of morphological data from the literature, enabling the generation of large-scale phenomic data, by facilitating the production and testing of informatic tools able to extract, link, annotate, and process morphological data. This descriptive model and its ontological applications will allow for clear and semantically interoperable integration of arthropod phenotypes in biodiversity studies.
Collapse
Affiliation(s)
- Jennifer C Girón
- Department of Entomology, Purdue University, West Lafayette, IN, USA
- Natural Science Research Laboratory, Museum of Texas Tech University, Lubbock, TX, USA
| | - Sergei Tarasov
- Finnish Museum of Natural History, University of Helsinki, Pohjoinen Rautatiekatu 13, FI-00014 Helsinki, Finland
| | | | | | - Aaron D Smith
- Department of Entomology, Purdue University, West Lafayette, IN, USA
| | - Markus Koch
- Institute of Evolutionary Biology and Ecology, University of Bonn, An der Immenburg 1, 53121 Bonn, Germany
| | - Brendon E Boudinot
- Department of Entomology & Nematology, University of California, Davis, One Shields Ave, CA, USA
- Institut für Zoologie und Evolutionsforschung, Friedrich-Schiller-Universität Jena, Erbertstraße 1, 07743 Jena, Germany
- Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington DC, USA
| | - Patrice Bouchard
- Biodiversity and Bioresources, Canadian National Collection of Insects, Arachnids and Nematodes, Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, Ontario, K1A 0C6, Canada
| | - Roger Burks
- Entomology Department, University of California, Riverside, 900 University Ave. Riverside, CA, USA
| | - Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1B, 30167 Hannover, Germany
| | - Matthew Yoder
- Illinois Natural History Survey, University of Illinois, Champaign, IL, USA
| | | | - Frank Friedrich
- Institut für Zell- und Systembiologie der Tiere, Universität Hamburg, Martin-Luther-King-Platz 3, 20146, Hamburg, Germany
| | - Rolf G Beutel
- Institut für Zoologie und Evolutionsforschung, Friedrich-Schiller-Universität Jena, Erbertstraße 1, 07743 Jena, Germany
| | - István Mikó
- Department of Biological Sciences, University of New Hampshire, Durham, NH, USA
| |
Collapse
|
2
|
Agosti D, Benichou L, Addink W, Arvanitidis C, Catapano T, Cochrane G, Dillen M, Döring M, Georgiev T, Gérard I, Groom Q, Kishor P, Kroh A, Kvaček J, Mergen P, Mietchen D, Pauperio J, Sautter G, Penev L. Recommendations for use of annotations and persistent identifiers in taxonomy and biodiversity publishing. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e97374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The paper summarises many years of discussions and experience of biodiversity publishers, organisations, research projects and individual researchers, and proposes recommendations for implementation of persistent identifiers for article metadata, structural elements (sections, subsections, figures, tables, references, supplementary materials and others) and data specific to biodiversity (taxonomic treatments, treatment citations, taxon names, material citations, gene sequences, specimens, scientific collections) in taxonomy and biodiversity publishing. The paper proposes best practices on how identifiers should be used in the different cases and on how they can be minted, cited, and expressed in the backend article XML to facilitate conversion to and further re-use of the article content as FAIR data. The paper also discusses several specific routes for post-publication re-use of semantically enhanced content through large biodiversity data aggregators such as the Global Biodiversity Information Facility (GBIF), the International Nucleotide Sequence Database Collaboration (INSDC) and others, and proposes specifications of both identifiers and XML tags to be used for that purpose. A summary table provides an account and overview of the recommendations. The guidelines are supported with examples from the existing publishing practices.
Collapse
|
3
|
Dimitrova M, Senderov VE, Georgiev T, Zhelezov G, Penev L. Infrastructure and Population of the OpenBiodiv Biodiversity Knowledge Graph. Biodivers Data J 2021; 9:e67671. [PMID: 34690512 PMCID: PMC8486731 DOI: 10.3897/bdj.9.e67671] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 09/08/2021] [Indexed: 11/12/2022] Open
Abstract
Background OpenBiodiv is a biodiversity knowledge graph containing a synthetic linked open dataset, OpenBiodiv-LOD, which combines knowledge extracted from academic literature with the taxonomic backbone used by the Global Biodiversity Information Facility. The linked open data is modelled according to the OpenBiodiv-O ontology integrating semantic resource types from recognised biodiversity and publishing ontologies with OpenBiodiv-O resource types, introduced to capture the semantics of resources not modelled before. New information We introduce the new release of the OpenBiodiv-LOD attained through information extraction and modelling of additional biodiversity entities. It was achieved by further developments to OpenBiodiv-O, the data storage infrastructure and the workflow and accompanying R software packages used for transformation of academic literature into Resource Description Framework (RDF). We discuss how to utilise the LOD in biodiversity informatics and give examples by providing solutions to several competency questions. We investigate performance issues that arise due to the large amount of inferred statements in the graph and conclude that OWL-full inference is impractical for the project and that unnecessary inference should be avoided.
Collapse
Affiliation(s)
- Mariya Dimitrova
- Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria Institute of Information and Communication Technologies, Bulgarian Academy of Sciences Sofia Bulgaria.,Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria
| | - Viktor E Senderov
- Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden Department of Bioinformatics and Genetics, Swedish Museum of Natural History Stockholm Sweden
| | - Teodor Georgiev
- Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria
| | - Georgi Zhelezov
- Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria
| | - Lyubomir Penev
- Pensoft Publishers, Sofia, Bulgaria Pensoft Publishers Sofia Bulgaria.,Institute of Biodiversity & Ecosystem Research, Bulgarian Academy of Sciences, Sofia, Bulgaria Institute of Biodiversity & Ecosystem Research, Bulgarian Academy of Sciences Sofia Bulgaria
| |
Collapse
|
4
|
Lücking A, Driller C, Stoeckel M, Abrami G, Pachzelt A, Mehler A. Multiple annotation for biodiversity: developing an annotation framework among biology, linguistics and text technology. LANG RESOUR EVAL 2021. [DOI: 10.1007/s10579-021-09553-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractBiodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However, to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the bio text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of bio is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon- or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the bio annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.
Collapse
|
5
|
Kõljalg U, Nilsson HR, Schigel D, Tedersoo L, Larsson KH, May TW, Taylor AFS, Jeppesen TS, Frøslev TG, Lindahl BD, Põldmaa K, Saar I, Suija A, Savchenko A, Yatsiuk I, Adojaan K, Ivanov F, Piirmann T, Pöhönen R, Zirk A, Abarenkov K. The Taxon Hypothesis Paradigm-On the Unambiguous Detection and Communication of Taxa. Microorganisms 2020; 8:E1910. [PMID: 33266327 PMCID: PMC7760934 DOI: 10.3390/microorganisms8121910] [Citation(s) in RCA: 93] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 11/24/2020] [Indexed: 12/27/2022] Open
Abstract
Here, we describe the taxon hypothesis (TH) paradigm, which covers the construction, identification, and communication of taxa as datasets. Defining taxa as datasets of individuals and their traits will make taxon identification and most importantly communication of taxa precise and reproducible. This will allow datasets with standardized and atomized traits to be used digitally in identification pipelines and communicated through persistent identifiers. Such datasets are particularly useful in the context of formally undescribed or even physically undiscovered species if data such as sequences from samples of environmental DNA (eDNA) are available. Implementing the TH paradigm will to some extent remove the impediment to hastily discover and formally describe all extant species in that the TH paradigm allows discovery and communication of new species and other taxa also in the absence of formal descriptions. The TH datasets can be connected to a taxonomic backbone providing access to the vast information associated with the tree of life. In parallel to the description of the TH paradigm, we demonstrate how it is implemented in the UNITE digital taxon communication system. UNITE TH datasets include rich data on individuals and their rDNA ITS sequences. These datasets are equipped with digital object identifiers (DOI) that serve to fix their identity in our communication. All datasets are also connected to a GBIF taxonomic backbone. Researchers processing their eDNA samples using UNITE datasets will, thus, be able to publish their findings as taxon occurrences in the GBIF data portal. UNITE species hypothesis (species level THs) datasets are increasingly utilized in taxon identification pipelines and even formally undescribed species can be identified and communicated by using UNITE. The TH paradigm seeks to achieve unambiguous, unique, and traceable communication of taxa and their properties at any level of the tree of life. It offers a rapid way to discover and communicate undescribed species in identification pipelines and data portals before they are lost to the sixth mass extinction.
Collapse
Affiliation(s)
- Urmas Kõljalg
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Henrik R. Nilsson
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, Box 461, 405 30 Göteborg, Sweden; (H.R.N.); (K.-H.L.)
| | - Dmitry Schigel
- Global Biodiversity Information Facility, 2100 Copenhagen, Denmark; (D.S.); (T.S.J.)
| | - Leho Tedersoo
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Karl-Henrik Larsson
- Department of Biological and Environmental Sciences, Gothenburg Global Biodiversity Centre, University of Gothenburg, Box 461, 405 30 Göteborg, Sweden; (H.R.N.); (K.-H.L.)
| | - Tom W. May
- Royal Botanic Gardens Victoria, Birdwood Ave, Melbourne, Victoria 3004, Australia;
| | - Andy F. S. Taylor
- The James Hutton Institute, Craigiebuckler, Aberdeen AB15 8QH, UK;
- Institute of Biological and Environmental Sciences, University of Aberdeen, Cruickshank Building, St Machar Drive, Aberdeen AB24 3UU, UK
| | | | | | - Björn D. Lindahl
- Systematic Biology, Evolutionary Biology Centre, Uppsala University, Norbyvägen 18D, 75236 Uppsala, Sweden;
| | - Kadri Põldmaa
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Irja Saar
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Ave Suija
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Anton Savchenko
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Iryna Yatsiuk
- Institute of Ecology and Earth Sciences, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (L.T.); (I.S.); (A.S.); (I.Y.)
| | - Kristjan Adojaan
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
| | - Filipp Ivanov
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
| | - Timo Piirmann
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
| | - Raivo Pöhönen
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
| | - Allan Zirk
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
| | - Kessy Abarenkov
- Natural History Museum, University of Tartu, 14a Ravila, 50411 Tartu, Estonia; (K.P.); (A.S.); (K.A.); (F.I.); (T.P.); (R.P.); (A.Z.); (K.A.)
| |
Collapse
|
6
|
Rivera-Quiroz FA, Petcharad B, Miller JA. Mining data from legacy taxonomic literature and application for sampling spiders of the Teutamus group (Araneae; Liocranidae) in Southeast Asia. Sci Rep 2020; 10:15787. [PMID: 32978432 PMCID: PMC7519673 DOI: 10.1038/s41598-020-72549-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Accepted: 09/02/2020] [Indexed: 11/12/2022] Open
Abstract
Taxonomic literature contains information about virtually ever known species on Earth. In many cases, all that is known about a taxon is contained in this kind of literature, particularly for the most diverse and understudied groups. Taxonomic publications in the aggregate have documented a vast amount of specimen data. Among other things, these data constitute evidence of the existence of a particular taxon within a spatial and temporal context. When knowledge about a particular taxonomic group is rudimentary, investigators motivated to contribute new knowledge can use legacy records to guide them in their search for new specimens in the field. However, these legacy data are in the form of unstructured text, making it difficult to extract and analyze without a human interpreter. Here, we used a combination of semi-automatic tools to extract and categorize specimen data from taxonomic literature of one family of ground spiders (Liocranidae). We tested the application of these data on fieldwork optimization, using the relative abundance of adult specimens reported in literature as a proxy to find the best times and places for collecting the species (Teutamus politus) and its relatives (Teutamus group, TG) within Southeast Asia. Based on these analyses we decided to collect in three provinces in Thailand during the months of June and August. With our approach, we were able to collect more specimens of T. politus (188 specimens, 95 adults) than all the previous records in literature combined (102 specimens). Our approach was also effective for sampling other representatives of the TG, yielding at least one representative of every TG genus previously reported for Thailand. In total, our samples contributed 231 specimens (134 adults) to the 351 specimens previously reported in the literature for this country. Our results exemplify one application of mined literature data that allows investigators to more efficiently allocate effort and resources for the study of neglected, endangered, or interesting taxa and geographic areas. Furthermore, the integrative workflow demonstrated here shares specimen data with global online resources like Plazi and GBIF, meaning that others can freely reuse these data and contribute to them in the future. The contributions of the present study represent an increase of more than 35% on the taxonomic coverage of the TG in GBIF based on the number of species. Also, our extracted data represents 72% of the occurrences now available through GBIF for the TG and more than 85% of occurrences of T. politus. Taxonomic literature is a key source of undigitized biodiversity data for taxonomic groups that are underrepresented in the current biodiversity data sphere. Mobilizing these data is key to understanding and protecting some of the less well-known domains of biodiversity.
Collapse
Affiliation(s)
- F Andres Rivera-Quiroz
- Department of Terrestrial Zoology, Understanding Evolution group, Naturalis Biodiversity Center, Darwinweg 2, 2333CR, Leiden, The Netherlands.
- Institute of Biology Leiden (IBL), Leiden University, Sylviusweg 72, 2333BE, Leiden, The Netherlands.
| | - Booppa Petcharad
- Faculty of Science and Technology, Thammasat University, Rangsit, 12121, Pathum Thani, Thailand
| | - Jeremy A Miller
- Department of Terrestrial Zoology, Understanding Evolution group, Naturalis Biodiversity Center, Darwinweg 2, 2333CR, Leiden, The Netherlands
- Plazi, Zinggstrasse 16, CH 3007, Bern, Switzerland
| |
Collapse
|
7
|
OpenBiodiv: A Knowledge Graph for Literature-Extracted Linked Open Data in Biodiversity Science. PUBLICATIONS 2019. [DOI: 10.3390/publications7020038] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Hundreds of years of biodiversity research have resulted in the accumulation of a substantial pool of communal knowledge; however, most of it is stored in silos isolated from each other, such as published articles or monographs. The need for a system to store and manage collective biodiversity knowledge in a community-agreed and interoperable open format has evolved into the concept of the Open Biodiversity Knowledge Management System (OBKMS). This paper presents OpenBiodiv: An OBKMS that utilizes semantic publishing workflows, text and data mining, common standards, ontology modelling and graph database technologies to establish a robust infrastructure for managing biodiversity knowledge. It is presented as a Linked Open Dataset generated from scientific literature. OpenBiodiv encompasses data extracted from more than 5000 scholarly articles published by Pensoft and many more taxonomic treatments extracted by Plazi from journals of other publishers. The data from both sources are converted to Resource Description Framework (RDF) and integrated in a graph database using the OpenBiodiv-O ontology and an RDF version of the Global Biodiversity Information Facility (GBIF) taxonomic backbone. Through the application of semantic technologies, the project showcases the value of open publishing of Findable, Accessible, Interoperable, Reusable (FAIR) data towards the establishment of open science practices in the biodiversity domain.
Collapse
|
8
|
Faulwetter S, Pafilis E, Fanini L, Bailly N, Agosti D, Arvanitidis C, Boicenco L, Catapano T, Claus S, Dekeyzer S, Georgiev T, Legaki A, Mavraki D, Oulas A, Papastefanou G, Penev L, Sautter G, Schigel D, Senderov V, Teaca A, Tsompanou M. EMODnet Workshop on mechanisms and guidelines to mobilise historical data into biogeographic databases. RESEARCH IDEAS AND OUTCOMES 2016. [DOI: 10.3897/rio.2.e10445] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
9
|
Faulwetter S, Pafilis E, Fanini L, Bailly N, Agosti D, Arvanitidis C, Boicenco L, Capatano T, Claus S, Dekeyzer S, Georgiev T, Legaki A, Mavraki D, Oulas A, Papastefanou G, Penev L, Sautter G, Schigel D, Senderov V, Teaca A, Tsompanou M. EMODnet Workshop on mechanisms and guidelines to mobilise historical data into biogeographic databases. RESEARCH IDEAS AND OUTCOMES 2016. [DOI: 10.3897/rio.2.e9774] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
10
|
Lyal CHC. Digitising legacy zoological taxonomic literature: Processes, products and using the output. Zookeys 2016:189-206. [PMID: 26877659 PMCID: PMC4741221 DOI: 10.3897/zookeys.550.9702] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2015] [Accepted: 03/26/2015] [Indexed: 02/02/2023] Open
Abstract
By digitising legacy taxonomic literature using XML mark-up the contents become accessible to other taxonomic and nomenclatural information systems. Appropriate schemas need to be interoperable with other sectorial schemas, atomise to appropriate content elements and carry appropriate metadata to, for example, enable algorithmic assessment of availability of a name under the Code. Legacy (and new) literature delivered in this fashion will become part of a global taxonomic resource from which users can extract tailored content to meet their particular needs, be they nomenclatural, taxonomic, faunistic or other. To date, most digitisation of taxonomic literature has led to a more or less simple digital copy of a paper original – the output of the many efforts has effectively been an electronic copy of a traditional library. While this has increased accessibility of publications through internet access, the means by which many scientific papers are indexed and located is much the same as with traditional libraries. OCR and born-digital papers allow use of web search engines to locate instances of taxon names and other terms, but OCR efficiency in recognising taxonomic names is still relatively poor, people’s ability to use search engines effectively is mixed, and many papers cannot be searched directly. Instead of building digital analogues of traditional publications, we should consider what properties we require of future taxonomic information access. Ideally the content of each new digital publication should be accessible in the context of all previous published data, and the user able to retrieve nomenclatural, taxonomic and other data / information in the form required without having to scan all of the original papers and extract target content manually. This opens the door to dynamic linking of new content with extant systems: automatic population and updating of taxonomic catalogues, ZooBank and faunal lists, all descriptions of a taxon and its children instantly accessible with a single search, comparison of classifications used in different publications, and so on. A means to do this is through marking up content into XML, and the more atomised the mark-up the greater the possibilities for data retrieval and integration. Mark-up requires XML that accommodates the required content elements and is interoperable with other XML schemas, and there are now several written to do this, particularly TaxPub, taxonX and taXMLit, the last of these being the most atomised. We now need to automate this process as far as possible. Manual and automatic data and information retrieval is demonstrated by projects such as INOTAXA and Plazi. As we move to creating and using taxonomic products through the power of the internet, we need to ensure the output, while satisfying in its production the requirements of the Code, is fit for purpose in the future.
Collapse
Affiliation(s)
- Christopher H C Lyal
- Life Sciences Department, The Natural History Museum, Cromwell Road, London SW7 5BD, UK
| |
Collapse
|
11
|
Senderov V, Penev L. The Open Biodiversity Knowledge Management System in Scholarly Publishing. RESEARCH IDEAS AND OUTCOMES 2016. [DOI: 10.3897/rio.2.e7757] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
|
12
|
Miller JA, Agosti D, Penev L, Sautter G, Georgiev T, Catapano T, Patterson D, King D, Pereira S, Vos RA, Sierra S. Integrating and visualizing primary data from prospective and legacy taxonomic literature. Biodivers Data J 2015; 3:e5063. [PMID: 26023286 PMCID: PMC4442254 DOI: 10.3897/bdj.3.e5063] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 05/06/2015] [Indexed: 11/24/2022] Open
Abstract
Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovry and description, and 2) the prevelence of rarity in species descriptions.
Collapse
Affiliation(s)
- Jeremy A. Miller
- Naturalis Biodiversity Center, Leiden, Netherlands
- www.Plazi.org, Bern, Switzerland
| | | | | | | | | | | | | | - David King
- The Open University, Milton Keynes, United Kingdom
| | | | | | | |
Collapse
|
13
|
Erwin T, Stoev P, Georgiev T, Penev L. ZooKeys 500: traditions and innovations hand-in-hand servicing our taxonomic community. Zookeys 2015:1-8. [PMID: 25987868 PMCID: PMC4432237 DOI: 10.3897/zookeys.500.9844] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2015] [Accepted: 04/22/2015] [Indexed: 11/22/2022] Open
Affiliation(s)
- Terry Erwin
- Hyper-diversity Group, Department of Entomology, MRC-187, National Museum of Natural History, Smithsonian Institution, Washington, P.O. Box 37012, DC 20013-7012, USA
| | - Pavel Stoev
- Pensoft Publishers, Sofia, Bulgaria ; National Museum of Natural History, Sofia, Bulgaria
| | | | - Lyubomir Penev
- Pensoft Publishers, Sofia, Bulgaria ; Institute of Biodiversity and Ecosystem Research, Bulgarian Academy of Sciences, Sofia, Bulgaria
| |
Collapse
|
14
|
Miller JA, Georgiev T, Stoev P, Sautter G, Penev L. Corrected data re-harvested: curating literature in the era of networked biodiversity informatics. Biodivers Data J 2015:e4552. [PMID: 25632264 PMCID: PMC4304254 DOI: 10.3897/bdj.3.e4552] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2015] [Accepted: 01/21/2015] [Indexed: 11/12/2022] Open
Affiliation(s)
- Jeremy A Miller
- Naturalis Biodiversity Center, Leiden, Netherlands ; www.Plazi.org, Bern, Switzerland
| | | | - Pavel Stoev
- National Museum of Natural History and Pensoft Publishers, Sofia, Bulgaria
| | | | - Lyubomir Penev
- Institute of Biodiversity & Ecosystem Research, Bulgarian Academy of Sciences and Pensoft Publishers, Sofia, Bulgaria
| |
Collapse
|
15
|
Abstract
Type material is the taxonomic device that ties formal names to the physical specimens that serve as exemplars for the species. For the prokaryotes these are strains submitted to the culture collections; for the eukaryotes they are specimens submitted to museums or herbaria. The NCBI Taxonomy Database (http://www.ncbi.nlm.nih.gov/taxonomy) now includes annotation of type material that we use to flag sequences from type in GenBank and in Genomes. This has important implications for many NCBI resources, some of which are outlined below.
Collapse
Affiliation(s)
- Scott Federhen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
16
|
Liew TS, Vermeulen JJ, Marzuki MEB, Schilthuizen M. A cybertaxonomic revision of the micro-landsnail genus Plectostoma Adam (Mollusca, Caenogastropoda, Diplommatinidae), from Peninsular Malaysia, Sumatra and Indochina. Zookeys 2014:1-107. [PMID: 24715783 PMCID: PMC3974427 DOI: 10.3897/zookeys.393.6717] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2013] [Accepted: 02/27/2014] [Indexed: 11/12/2022] Open
Abstract
Plectostoma is a micro land snail restricted to limestone outcrops in Southeast Asia. Plectostoma was previously classified as a subgenus of Opisthostoma because of the deviation from regular coiling in many species in both taxa. This paper is the first of a two-part revision of the genus Plectostoma, and includes all non-Borneo species. In the present paper, we examined 214 collection samples of 31 species, and obtained 62 references, 290 pictures, and 155 3D-models of 29 Plectostoma species and 51 COI sequences of 19 species. To work with such a variety of taxonomic data, and then to represent it in an integrated, scaleable and accessible manner, we adopted up-to-date cybertaxonomic tools. All the taxonomic information, such as references, classification, species descriptions, specimen images, genetic data, and distribution data, were tagged and linked with cyber tools and web servers (e.g. Lifedesks, Google Earth, and Barcoding of Life Database). We elevated Plectostoma from subgenus to genus level based on morphological, ecological and genetic evidence. We revised the existing 21 Plectostoma species and described 10 new species, namely, P. dindingensissp. n., P. mengaburensissp. n., P. whittenisp. n., P. kayianisp. n., P. davisonisp. n., P. relauensissp. n., P. kubuensissp. n., P. tohchinyawisp. n., P. tenggekensissp. n., and P. ikanensissp. n. All the synthesised, semantic-tagged, and linked taxonomic information is made freely and publicly available online.
Collapse
Affiliation(s)
- Thor-Seng Liew
- Naturalis Biodiversity Center, P.O. Box 9517, 2300 RA Leiden, The Netherlands ; Institute Biology Leiden, Leiden University, P.O. Box 9516, 2300 RA Leiden, The Netherlands ; Institute for Tropical Biology and Conservation, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia ; Rimba, 4 Jalan 1/9D, 43650, Bandar Baru Bangi, Selangor, Malaysia
| | - Jaap Jan Vermeulen
- Naturalis Biodiversity Center, P.O. Box 9517, 2300 RA Leiden, The Netherlands ; jk.artandscience, Lauwerbes 8, 2318 AT, Leiden, The Netherlandss
| | | | - Menno Schilthuizen
- Naturalis Biodiversity Center, P.O. Box 9517, 2300 RA Leiden, The Netherlands ; Institute Biology Leiden, Leiden University, P.O. Box 9516, 2300 RA Leiden, The Netherlands ; Institute for Tropical Biology and Conservation, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia
| |
Collapse
|
17
|
Henle K, Bell S, Brotons L, Clobert J, Evans D, Goerg C, Grodzinska-Jurcak M, Gruber B, Haila Y, Henry PY, Huth A, Julliard R, Keil P, Kleyer M, Kotze DJ, Kunin W, Lengyel S, Lin YP, Loyau A, Luck G, Magnuson W, Margules C, Matsinos Y, May P, Sousa-Pinto I, Possingham H, Potts S, Ring I, Pryke J, Samways M, Saunders D, Schmeller D, Simila J, Sommer S, Steffan-Dewenter I, Stoev P, Sykes M, Tóthmérész B, Yam R, Tzanopoulos J, Penev L. Nature Conservation – a new dimension in Open Access publishing bridging science and application. NATURE CONSERVATION 2012. [DOI: 10.3897/natureconservation.1.3081] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
18
|
Remsen D, Knapp S, Georgiev T, Stoev P, Penev L. From text to structured data: Converting a word-processed floristic checklist into Darwin Core Archive format. PHYTOKEYS 2012; 9:1-13. [PMID: 22371687 PMCID: PMC3281575 DOI: 10.3897/phytokeys.9.2770] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2012] [Accepted: 01/27/2012] [Indexed: 05/24/2023]
Abstract
The paper describes a pilot project to convert a conventional floristic checklist, written in a standard word processing program, into structured data in the Darwin Core Archive format. After peer-review and editorial acceptance, the final revised version of the checklist was converted into Darwin Core Archive by means of regular expressions and published thereafter in both human-readable form as traditional botanical publication and Darwin Core Archive data files. The data were published and indexed through the Global Biodiversity Information Facility (GBIF) Integrated Publishing Toolkit (IPT) and significant portions of the text of the paper were used to describe the metadata on IPT. After publication, the data will become available through the GBIF infrastructure and can be re-used on their own or collated with other data.
Collapse
Affiliation(s)
- David Remsen
- Global Biodiversity Information Facility, Copenhagen, Denmark
| | - Sandra Knapp
- Department of Botany, Natural History Museum, Cromwell Road, London SW7 5BD, UK
| | | | - Pavel Stoev
- National Museum of Natural History & Pensoft Publishers, Sofia, Bulgaria
| | - Lyubomir Penev
- Institute of Biodiversity and Ecosystem Research & Pensoft Publishers, Sofia, Bulgaria
| |
Collapse
|
19
|
Berendsohn WG, Güntsch A, Hoffmann N, Kohlbecker A, Luther K, Müller A. Biodiversity information platforms: From standards to interoperability. Zookeys 2011:71-87. [PMID: 22207807 PMCID: PMC3234432 DOI: 10.3897/zookeys.150.2166] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2011] [Accepted: 11/23/2011] [Indexed: 11/19/2022] Open
Abstract
One of the most serious bottlenecks in the scientific workflows of biodiversity sciences is the need to integrate data from different sources, software applications, and services for analysis, visualisation and publication. For more than a quarter of a century the TDWG Biodiversity Information Standards organisation has a central role in defining and promoting data standards and protocols supporting interoperability between disparate and locally distributed systems.Although often not sufficiently recognized, TDWG standards are the foundation of many popular Biodiversity Informatics applications and infrastructures ranging from small desktop software solutions to large scale international data networks. However, individual scientists and groups of collaborating scientist have difficulties in fully exploiting the potential of standards that are often notoriously complex, lack non-technical documentations, and use different representations and underlying technologies. In the last few years, a series of initiatives such as Scratchpads, the EDIT Platform for Cybertaxonomy, and biowikifarm have started to implement and set up virtual work platforms for biodiversity sciences which shield their users from the complexity of the underlying standards. Apart from being practical work-horses for numerous working processes related to biodiversity sciences, they can be seen as information brokers mediating information between multiple data standards and protocols.The ViBRANT project will further strengthen the flexibility and power of virtual biodiversity working platforms by building software interfaces between them, thus facilitating essential information flows needed for comprehensive data exchange, data indexing, web-publication, and versioning. This work will make an important contribution to the shaping of an international, interoperable, and user-oriented biodiversity information infrastructure.
Collapse
Affiliation(s)
- W G Berendsohn
- Department of Biodiversity Informatics and Laboratories, Botanic Garden and Botanical Museum Berlin-Dahlem, Freie Universität Berlin, Königin-Luise-Straße 6-8, 14195 Berlin, Germany
| | | | | | | | | | | |
Collapse
|
20
|
Smith VS, Penev L. Collaborative electronic infrastructures to accelerate taxonomic research. Zookeys 2011:1-3. [PMID: 22207803 PMCID: PMC3234428 DOI: 10.3897/zookeys.150.2458] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2011] [Accepted: 11/28/2011] [Indexed: 11/30/2022] Open
|
21
|
Abstract
The Platygastroidea Planetary Biodiversity Inventory is a large-scale, multinational effort to significantly advance the taxonomy and systematics of one group of parasitoid wasps. Based on this effort, there are some clear steps that should be taken to increase the efficiency and throughput of the taxonomic process. Increased collaboration among taxonomic specialists can significantly shorten the timeline and add increased rigor to the development of hypotheses of characters and taxa. Species delimitations should make use of multiple data sources, thus providing more nearly independent tests of these hypotheses. Taxonomy should fully embrace electronic media and informatics tools. Particularly, this step requires the development and widespread implementation of community data standards. The barriers to progress in these areas are not technological, but are primarily social. The community needs to see clear evidence of the value added through these changes in procedures and insist upon their use as standard practice.
Collapse
|