1
|
Cho MH, Cho KH, No KT. PhyloSophos: a high-throughput scientific name mapping algorithm augmented with explicit consideration of taxonomic science, and its application on natural product (NP) occurrence database processing. BMC Bioinformatics 2023; 24:475. [PMID: 38097955 PMCID: PMC10722791 DOI: 10.1186/s12859-023-05588-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND The standardization of biological data using unique identifiers is vital for seamless data integration, comprehensive interpretation, and reproducibility of research findings, contributing to advancements in bioinformatics and systems biology. Despite being widely accepted as a universal identifier, scientific names for biological species have inherent limitations, including lack of stability, uniqueness, and convertibility, hindering their effective use as identifiers in databases, particularly in natural product (NP) occurrence databases, posing a substantial obstacle to utilizing this valuable data for large-scale research applications. RESULT To address these challenges and facilitate high-throughput analysis of biological data involving scientific names, we developed PhyloSophos, a Python package that considers the properties of scientific names and taxonomic systems to accurately map name inputs to entries within a chosen reference database. We illustrate the importance of assessing multiple taxonomic databases and considering taxonomic syntax-based pre-processing using NP occurrence databases as an example, with the ultimate goal of integrating heterogeneous information into a single, unified dataset. CONCLUSIONS We anticipate PhyloSophos to significantly aid in the systematic processing of poorly digitized and curated biological data, such as biodiversity information and ethnopharmacological resources, enabling full-scale bioinformatics analysis using these valuable data resources.
Collapse
Affiliation(s)
- Min Hyung Cho
- Bioinformatics and Molecular Design Research Center (BMDRC), 209, Veritas A Hall, Yonsei University, 85 Songdogwahak-ro, Yeonsu-gu, Incheon, 21983, Republic of Korea.
| | - Kwang-Hwi Cho
- School of Systems Biomedical Science, Soongsil University, Seoul, 06978, South Korea
| | - Kyoung Tai No
- Bioinformatics and Molecular Design Research Center (BMDRC), 209, Veritas A Hall, Yonsei University, 85 Songdogwahak-ro, Yeonsu-gu, Incheon, 21983, Republic of Korea
- Department of Integrative Biotechnology and Translational Medicine, 214, Veritas A Hall, Yonsei University, 85 Songdogwahak-ro, Yeonsu-gu, Incheon, 21983, Republic of Korea
| |
Collapse
|
2
|
Seah BKB. Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers. Biodivers Data J 2023; 11:e114076. [PMID: 38312332 PMCID: PMC10838036 DOI: 10.3897/bdj.11.e114076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 11/06/2023] [Indexed: 02/06/2024] Open
Abstract
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
Collapse
Affiliation(s)
- Brandon Kwee Boon Seah
- Thünen Institute for Biodiversity, Braunschweig, GermanyThünen Institute for BiodiversityBraunschweigGermany
| |
Collapse
|
3
|
Brown MJM, Walker BE, Black N, Govaerts RHA, Ondo I, Turner R, Nic Lughadha E. rWCVP: a companion R package for the World Checklist of Vascular Plants. THE NEW PHYTOLOGIST 2023; 240:1355-1365. [PMID: 37289204 DOI: 10.1111/nph.18919] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 01/06/2023] [Indexed: 06/09/2023]
Abstract
The World Checklist of Vascular Plants (WCVP) is an extremely valuable resource that is being used to address many fundamental and applied questions in plant science, conservation, ecology and evolution. However, databases of this size require data manipulation skills that pose a barrier to many potential users. Here, we present rWCVP, an open-source R package that aims to facilitate the use of the WCVP by providing clear, intuitive functions to execute many common tasks. These functions include taxonomic name reconciliation, geospatial integration, mapping and generation of multiple different summaries of the WCVP in both data and report format. We have included extensive documentation and tutorials, providing step-by-step guides that are accessible even to users with minimal programming experience. rWCVP is available on cran and GitHub.
Collapse
Affiliation(s)
| | | | | | | | - Ian Ondo
- Royal Botanic Gardens, Kew, Richmond, TW9 3AB, UK
| | | | | |
Collapse
|
4
|
Patterson D. The scope and scale of the life sciences (‘Nature’s envelope’). RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e96132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
The extension of biology with a more data-centric component offers new opportunities for discovery. To enable investigations that rely on third-party data, the infrastructure that retains data and allows their re-use should, arguably, enable transactions that relate to any and all biological processes. The assembly of such a service-oriented and enabling infrastructure is challenging. Part of the challenge is to factor in the scope and scale of biological processes. From this foundation can emerge an estimate of the number of discipline-specific centres which will gather data in their given area of interest and prepare them for a path that will lead to trusted, persistent data repositories which will make fit-for-purpose data available for re-use. A simple model is presented for the scope and scale of life sciences. It can accommodate all known processes conducted by or caused by any and all organisms. It is depicted on a grid, the axes of which are (x) the durations of the processes and (y) the sizes of participants involved. Both axes are presented in log10 scales, and the grid is divided into decadal blocks with ten fold increments of time and size. Processes range in duration from 10-17 seconds to 3.5 billion years or more, and the sizes of participants range from 10-15 to 1.3 107 metres. Examples are given to illustrate the diversity of biological processes and their often inexact character. About half of the blocks within the grid do not contain known processes. The blocks that include biological processes amount to ‘Nature’s envelope’, a valuable rhetorical device onto which subdisciplines and existing initiatives may be mapped, and from which can be derived some key requirements for a comprehensive data infrastructure.
Collapse
|
5
|
Grenié M, Berti E, Carvajal‐Quintero J, Dädlow GML, Sagouis A, Winter M. Harmonizing taxon names in biodiversity data: a review of tools, databases, and best practices. Methods Ecol Evol 2022. [DOI: 10.1111/2041-210x.13802] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Matthias Grenié
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Puschstraße 4 04103 Leipzig Germany
- Leipzig University Ritterstraße 26 04109 Leipzig Germany
| | - Emilio Berti
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Puschstraße 4 04103 Leipzig Germany
- Friedrich‐Schiller University Jena Jena Germany
| | - Juan Carvajal‐Quintero
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Puschstraße 4 04103 Leipzig Germany
- Leipzig University Ritterstraße 26 04109 Leipzig Germany
| | - Gala Mona Louise Dädlow
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Puschstraße 4 04103 Leipzig Germany
- Leipzig University Ritterstraße 26 04109 Leipzig Germany
| | - Alban Sagouis
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Puschstraße 4 04103 Leipzig Germany
- Department of Computer Science Martin Luther University Halle‐Wittenberg, Halle Germany
| | - Marten Winter
- German Centre for Integrative Biodiversity Research (iDiv) Halle‐Jena‐Leipzig Puschstraße 4 04103 Leipzig Germany
- Leipzig University Ritterstraße 26 04109 Leipzig Germany
| |
Collapse
|
6
|
Durso AM, Ruiz de Castañeda R, Montalcini C, Mondardini MR, Fernandez-Marques JL, Grey F, Müller MM, Uetz P, Marshall BM, Gray RJ, Smith CE, Becker D, Pingleton M, Louies J, Abegg AD, Akuboy J, Alcoba G, Daltry JC, Entiauspe-Neto OM, Freed P, de Freitas MA, Glaudas X, Huang S, Huang T, Kalki Y, Kojima Y, Laudisoit A, Limbu KP, Martínez-Fonseca JG, Mebert K, Rödel MO, Ruane S, Ruedi M, Schmitz A, Tatum SA, Tillack F, Visvanathan A, Wüster W, Bolon I. Citizen science and online data: Opportunities and challenges for snake ecology and action against snakebite. Toxicon X 2021; 9-10:100071. [PMID: 34278294 PMCID: PMC8264216 DOI: 10.1016/j.toxcx.2021.100071] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 06/10/2021] [Accepted: 06/15/2021] [Indexed: 12/03/2022] Open
Abstract
The secretive behavior and life history of snakes makes studying their biology, distribution, and the epidemiology of venomous snakebite challenging. One of the most useful, most versatile, and easiest to collect types of biological data are photographs, particularly those that are connected with geographic location and date-time metadata. Photos verify occurrence records, provide data on phenotypes and ecology, and are often used to illustrate new species descriptions, field guides and identification keys, as well as in training humans and computer vision algorithms to identify snakes. We scoured eleven online and two offline sources of snake photos in an attempt to collect as many photos of as many snake species as possible, and attempt to explain some of the inter-species variation in photograph quantity among global regions and taxonomic groups, and with regard to medical importance, human population density, and range size. We collected a total of 725,565 photos-between 1 and 48,696 photos of 3098 of the world's 3879 snake species (79.9%), leaving 781 "most wanted" species with no photos (20.1% of all currently-described species as of the December 2020 release of The Reptile Database). We provide a list of most wanted species sortable by family, continent, authority, and medical importance, and encourage snake photographers worldwide to submit photos and associated metadata, particularly of "missing" species, to the most permanent and useful online archives: The Reptile Database, iNaturalist, and HerpMapper.
Collapse
Affiliation(s)
- Andrew M. Durso
- Department of Biological Sciences, Florida Gulf Coast University, Ft. Myers, FL, USA
- Institute of Global Health, Department of Community Health and Medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Rafael Ruiz de Castañeda
- Institute of Global Health, Department of Community Health and Medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- World Health Organization, Geneva, Switzerland
| | | | - M. Rosa Mondardini
- Citizen Science Center Zürich (ETH Zürich and University of Zürich), Zürich, Switzerland
| | | | | | | | - Peter Uetz
- The Reptile Database, Richmond, VA, USA
- Virginia Commonwealth University, Richmond, VA, USA
| | | | | | | | | | | | | | - Arthur D. Abegg
- Instituto Butantan, São Paulo, São Paulo, Brazil
- University of São Paulo, São Paulo, São Paulo, Brazil
| | - Jeannot Akuboy
- University of Kisangani, Kisangani, Democratic Republic of the Congo
| | | | - Jennifer C. Daltry
- Flora & Fauna International, Cambridge, England, UK
- Global Wildlife Conservation, Austin, TX, USA
| | | | - Paul Freed
- The Reptile Database, Richmond, VA, USA
- Reptile Database, Scotts Mills, OR, USA
| | | | - Xavier Glaudas
- University of the Witwatersrand, Johannesburg, South Africa
- Bangor University, Bangor, Wales, UK
| | - Song Huang
- Anhui Normal University, Wuhu, Anhui, China
| | | | - Yatin Kalki
- Madras Crocodile Bank Trust, Mahabalipuram, Tamil Nadu, India
| | | | | | | | | | - Konrad Mebert
- Global Biology, Birr, Switzerland
- Institute of Development, Ecology, Conservation & Cooperation, Rome, Italy
| | - Mark-Oliver Rödel
- Museum für Naturkunde - Leibniz Institute for Evolution and Biodiversity Science, Berlin, Germany
| | | | - Manuel Ruedi
- Museum d'Histoire naturelle Geneve, Geneva, Switzerland
| | | | | | - Frank Tillack
- Museum für Naturkunde - Leibniz Institute for Evolution and Biodiversity Science, Berlin, Germany
| | | | - Wolfgang Wüster
- Molecular Ecology and Fisheries Genetics Laboratory, School of Natural Sciences, Bangor University, Bangor, Wales, UK
| | - Isabelle Bolon
- Institute of Global Health, Department of Community Health and Medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
7
|
Stribling JB, Leppo EW. Relationship of taxonomic error to frequency of observation. PLoS One 2020; 15:e0241933. [PMID: 33180842 PMCID: PMC7660486 DOI: 10.1371/journal.pone.0241933] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Accepted: 10/22/2020] [Indexed: 11/18/2022] Open
Abstract
Biological nomenclature is the entry point to a wealth of information related to or associated with living entities. When applied accurately and consistently, communication between and among researchers and investigators is enhanced, leading to advancements in understanding and progress in research programs. Based on freshwater benthic macroinvertebrate taxonomic identifications, inter-laboratory comparisons of >900 samples taken from rivers, streams, and lakes across the U.S., including the Great Lakes, provided data on taxon-specific error rates. Using the error rates in combination with frequency of observation (FREQ; as a surrogate for rarity), six uncertainty/frequency classes (UFC) are proposed for approximately 1,000 taxa. The UFC, error rates, FREQ each are potentially useful for additional analyses related to interpreting biological assessment results and/or stressor response relationships, as weighting factors for various aspects of ecological condition or biodiversity analyses and helping set direction for taxonomic research and refining identification tools.
Collapse
Affiliation(s)
- James B. Stribling
- Tetra Tech, Incorporated Center for Ecological Sciences, Owings Mills, Maryland, United States of America
| | - Erik W. Leppo
- Tetra Tech, Incorporated Center for Ecological Sciences, Owings Mills, Maryland, United States of America
| |
Collapse
|
8
|
Campbell DL, Thessen AE, Ries L. A novel curation system to facilitate data integration across regional citizen science survey programs. PeerJ 2020; 8:e9219. [PMID: 32821528 PMCID: PMC7395600 DOI: 10.7717/peerj.9219] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2019] [Accepted: 04/28/2020] [Indexed: 11/20/2022] Open
Abstract
Integrative modeling methods can now enable macrosystem-level understandings of biodiversity patterns, such as range changes resulting from shifts in climate or land use, by aggregating species-level data across multiple monitoring sources. This requires ensuring that taxon interpretations match up across different sources. While encouraging checklist standardization is certainly an option, coercing programs to change species lists they have used consistently for decades is rarely successful. Here we demonstrate a novel approach for tracking equivalent names and concepts, applied to a network of 10 regional programs that use the same protocols (so-called “Pollard walks”) to monitor butterflies across America north of Mexico. Our system involves, for each monitoring program, associating the taxonomic authority (in this case one of three North American butterfly fauna treatments: Pelham, 2014; North American Butterfly Association, Inc., 2016; Opler & Warren, 2003) that shares the most similar overall taxonomic interpretation to the program’s working species list. This allows us to define each term on each program’s list in the context of the appropriate authority’s species concept and curate the term alongside its authoritative concept. We then aligned the names representing equivalent taxonomic concepts among the three authorities. These stepping stones allow us to bridge a species concept from one program’s species list to the name of the equivalent in any other program, through the intermediary scaffolding of aligned authoritative taxon concepts. Using a software tool we developed to access our curation system, a user can link equivalent species concepts between data collecting agencies with no specialized knowledge of taxonomic complexities.
Collapse
Affiliation(s)
- Dana L Campbell
- Division of Biological Sciences, School of STEM, University of Washington, Bothell, WA, USA
| | - Anne E Thessen
- The Ronin Institute for Independent Scholarship, Montclair, NJ, USA.,Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | - Leslie Ries
- Department of Biology, Georgetown University, Washington, DC, USA
| |
Collapse
|
9
|
Walton S, Livermore L, Bánki O, Cubey R, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C, Rey I, Santos C, Scott B, Williams A, Wu Z. Landscape Analysis for the Specimen Data Refinery. RESEARCH IDEAS AND OUTCOMES 2020. [DOI: 10.3897/rio.6.e57602] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens.
Collapse
|
10
|
Portik DM, Wiens JJ. SuperCRUNCH: A bioinformatics toolkit for creating and manipulating supermatrices and other large phylogenetic datasets. Methods Ecol Evol 2020. [DOI: 10.1111/2041-210x.13392] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Daniel M. Portik
- Department of Ecology and Evolutionary Biology University of Arizona Tucson AZ USA
- California Academy of Sciences San Francisco CA USA
| | - John J. Wiens
- Department of Ecology and Evolutionary Biology University of Arizona Tucson AZ USA
| |
Collapse
|
11
|
Bioinformatics for Marine Products: An Overview of Resources, Bottlenecks, and Perspectives. Mar Drugs 2019; 17:md17100576. [PMID: 31614509 PMCID: PMC6835618 DOI: 10.3390/md17100576] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2019] [Revised: 10/01/2019] [Accepted: 10/02/2019] [Indexed: 12/13/2022] Open
Abstract
The sea represents a major source of biodiversity. It exhibits many different ecosystems in a huge variety of environmental conditions where marine organisms have evolved with extensive diversification of structures and functions, making the marine environment a treasure trove of molecules with potential for biotechnological applications and innovation in many different areas. Rapid progress of the omics sciences has revealed novel opportunities to advance the knowledge of biological systems, paving the way for an unprecedented revolution in the field and expanding marine research from model organisms to an increasing number of marine species. Multi-level approaches based on molecular investigations at genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, and metabolomic levels are essential to discover marine resources and further explore key molecular processes involved in their production and action. As a consequence, omics approaches, accompanied by the associated bioinformatic resources and computational tools for molecular analyses and modeling, are boosting the rapid advancement of biotechnologies. In this review, we provide an overview of the most relevant bioinformatic resources and major approaches, highlighting perspectives and bottlenecks for an appropriate exploitation of these opportunities for biotechnology applications from marine resources.
Collapse
|
12
|
Minelli A. The galaxy of the non-Linnaean nomenclature. HISTORY AND PHILOSOPHY OF THE LIFE SCIENCES 2019; 41:31. [PMID: 31435827 DOI: 10.1007/s40656-019-0271-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 08/08/2019] [Indexed: 06/10/2023]
Abstract
Contrary to the traditional claim that needs for unambiguous communication about animal and plant species are best served by a single set of names (Linnaean nomenclature) ruled by international Codes, I suggest that a more diversified system is required, especially to cope with problems emerging from aggregation of biodiversity data in large databases. Departures from Linnaean nomenclature are sometimes intentional, but there are also other, less obvious but widespread forms of not Code-compliant grey nomenclature. A first problem is due to the circumstance that the Codes are intended to rule over the way names are applied to species and other taxonomic units, whereas users of taxonomy need names to be applied to specimens. For different reasons, it is often impossible to refer a specimen with certainty to a named species, and in those cases an open nomenclature is employed. Second, molecular taxonomy leads to the discovery of clusters of gene sequence diversity not necessarily equivalent to the species recognized and named by taxonomists. Those clusters are mostly indicated with informal names or formulas that challenge comparison between different publications or databases. In several instances, it is not even clear if a formula refers to an individual voucher specimen, or is a provisional species name. The use of non-Linnaean names and formulas must be revised and strengthened by fixing standard formats for the different kinds of objects or hypotheses and providing permanent association of 'grey names' with standardized source information such as author and year. In the context of a broad-scope revisitation of aims and scope of scientific nomenclature, it may be worth rethinking if natural objects like plant galls and lichens, although other than the 'single-entity' objects traditionally covered by biological classifications, may nevertheless deserve taxonomic names.
Collapse
Affiliation(s)
- Alessandro Minelli
- Department of Biology, University of Padova, Via Ugo Bassi 58 B, 35131, Padua, Italy.
| |
Collapse
|
13
|
A dataset of egg size and shape from more than 6,700 insect species. Sci Data 2019; 6:104. [PMID: 31270334 PMCID: PMC6610123 DOI: 10.1038/s41597-019-0049-y] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Accepted: 01/25/2019] [Indexed: 12/20/2022] Open
Abstract
Offspring size is a fundamental trait in disparate biological fields of study. This trait can be measured as the size of plant seeds, animal eggs, or live young, and it influences ecological interactions, organism fitness, maternal investment, and embryonic development. Although multiple evolutionary processes have been predicted to drive the evolution of offspring size, the phylogenetic distribution of this trait remains poorly understood, due to the difficulty of reliably collecting and comparing offspring size data from many species. Here we present a dataset of 10,449 morphological descriptions of insect eggs, with records for 6,706 unique insect species and representatives from every extant hexapod order. The dataset includes eggs whose volumes span more than eight orders of magnitude. We created this dataset by partially automating the extraction of egg traits from the primary literature. In the process, we overcame challenges associated with large-scale phenotyping by designing and employing custom bioinformatic solutions to common problems. We matched the taxa in this dataset to the currently accepted scientific names in taxonomic and genetic databases, which will facilitate the use of these data for testing pressing evolutionary hypotheses in offspring size evolution. Design Type(s) | software development objective • morphology-based phylogenetic analysis objective • species comparison design | Measurement Type(s) | morphology | Technology Type(s) | digital curation | Factor Type(s) | shape • size | Sample Characteristic(s) | Hexapoda • egg |
Machine-accessible metadata file describing the reported data (ISA-Tab format)
Collapse
|
14
|
Insect egg size and shape evolve with ecology but not developmental rate. Nature 2019; 571:58-62. [PMID: 31270484 DOI: 10.1038/s41586-019-1302-4] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 05/14/2019] [Indexed: 12/25/2022]
Abstract
Over the course of evolution, organism size has diversified markedly. Changes in size are thought to have occurred because of developmental, morphological and/or ecological pressures. To perform phylogenetic tests of the potential effects of these pressures, here we generated a dataset of more than ten thousand descriptions of insect eggs, and combined these with genetic and life-history datasets. We show that, across eight orders of magnitude of variation in egg volume, the relationship between size and shape itself evolves, such that previously predicted global patterns of scaling do not adequately explain the diversity in egg shapes. We show that egg size is not correlated with developmental rate and that, for many insects, egg size is not correlated with adult body size. Instead, we find that the evolution of parasitoidism and aquatic oviposition help to explain the diversification in the size and shape of insect eggs. Our study suggests that where eggs are laid, rather than universal allometric constants, underlies the evolution of insect egg size and shape.
Collapse
|
15
|
Stucky BJ, Balhoff JP, Barve N, Barve V, Brenskelle L, Brush MH, Dahlem GA, Gilbert JDJ, Kawahara AY, Keller O, Lucky A, Mayhew PJ, Plotkin D, Seltmann KC, Talamas E, Vaidya G, Walls R, Yoder M, Zhang G, Guralnick R. Developing a vocabulary and ontology for modeling insect natural history data: example data, use cases, and competency questions. Biodivers Data J 2019; 7:e33303. [PMID: 30918448 PMCID: PMC6426826 DOI: 10.3897/bdj.7.e33303] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Accepted: 02/28/2019] [Indexed: 11/12/2022] Open
Abstract
Insects are possibly the most taxonomically and ecologically diverse class of multicellular organisms on Earth. Consequently, they provide nearly unlimited opportunities to develop and test ecological and evolutionary hypotheses. Currently, however, large-scale studies of insect ecology, behavior, and trait evolution are impeded by the difficulty in obtaining and analyzing data derived from natural history observations of insects. These data are typically highly heterogeneous and widely scattered among many sources, which makes developing robust information systems to aggregate and disseminate them a significant challenge. As a step towards this goal, we report initial results of a new effort to develop a standardized vocabulary and ontology for insect natural history data. In particular, we describe a new database of representative insect natural history data derived from multiple sources (but focused on data from specimens in biological collections), an analysis of the abstract conceptual areas required for a comprehensive ontology of insect natural history data, and a database of use cases and competency questions to guide the development of data systems for insect natural history data. We also discuss data modeling and technology-related challenges that must be overcome to implement robust integration of insect natural history data.
Collapse
Affiliation(s)
- Brian J. Stucky
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC, United States of AmericaRenaissance Computing Institute, University of North CarolinaChapel Hill, NCUnited States of America
| | - Narayani Barve
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Vijay Barve
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Laura Brenskelle
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Matthew H. Brush
- Oregon Health and Science University, Portland, OR, United States of AmericaOregon Health and Science UniversityPortland, ORUnited States of America
| | - Gregory A Dahlem
- Department of Biological Sciences, Northern Kentucky University, Highland Heights, KY, United States of AmericaDepartment of Biological Sciences, Northern Kentucky UniversityHighland Heights, KYUnited States of America
| | - James D. J. Gilbert
- Department of Biological and Marine Sciences, University of Hull, Hull, United KingdomDepartment of Biological and Marine Sciences, University of HullHullUnited Kingdom
| | - Akito Y. Kawahara
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
- Entomology and Nematology Department, University of Florida, Gainesville, FL, United States of AmericaEntomology and Nematology Department, University of FloridaGainesville, FLUnited States of America
| | - Oliver Keller
- Entomology and Nematology Department, University of Florida, Gainesville, FL, United States of AmericaEntomology and Nematology Department, University of FloridaGainesville, FLUnited States of America
| | - Andrea Lucky
- Entomology and Nematology Department, University of Florida, Gainesville, FL, United States of AmericaEntomology and Nematology Department, University of FloridaGainesville, FLUnited States of America
| | - Peter J. Mayhew
- Department of Biology, University of York, York, United KingdomDepartment of Biology, University of YorkYorkUnited Kingdom
| | - David Plotkin
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | | | - Elijah Talamas
- Florida Department of Agriculture and Consumer Services, Gainesville, FL, United States of AmericaFlorida Department of Agriculture and Consumer ServicesGainesville, FLUnited States of America
| | - Gaurav Vaidya
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Ramona Walls
- Bio5 and CyVerse, University of Arizona, Tucson, AZ, United States of AmericaBio5 and CyVerse, University of ArizonaTucson, AZUnited States of America
| | - Matt Yoder
- Species File Group, Illinois Natural History Survey, University of Illinois, Champaign, IL, United States of AmericaSpecies File Group, Illinois Natural History Survey, University of IllinoisChampaign, ILUnited States of America
| | - Guanyang Zhang
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| | - Rob Guralnick
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of AmericaFlorida Museum of Natural History, University of FloridaGainesville, FLUnited States of America
| |
Collapse
|
16
|
Jackson LM, Fernando PC, Hanscom JS, Balhoff JP, Mabee PM. Automated Integration of Trees and Traits: A Case Study Using Paired Fin Loss Across Teleost Fishes. Syst Biol 2018; 67:559-575. [PMID: 29325126 PMCID: PMC6005059 DOI: 10.1093/sysbio/syx098] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Revised: 12/15/2017] [Accepted: 12/21/2017] [Indexed: 11/24/2022] Open
Abstract
Data synthesis required for large-scale macroevolutionary studies is challenging with the current tools available for integration. Using a classic question regarding the frequency of paired fin loss in teleost fishes as a case study, we sought to create automated methods to facilitate the integration of broad-scale trait data with a sizable species-level phylogeny. Similar to the evolutionary pattern previously described for limbs, pelvic and pectoral fin reduction and loss are thought to have occurred independently multiple times in the evolution of fishes. We developed a bioinformatics pipeline to identify the presence and absence of pectoral and pelvic fins of 12,582 species. To do this, we integrated a synthetic morphological supermatrix of phenotypic data for the pectoral and pelvic fins for teleost fishes from the Phenoscape Knowledgebase (two presence/absence characters for 3047 taxa) with a species-level tree for teleost fishes from the Open Tree of Life project (38,419 species). The integration method detailed herein harnessed a new combined approach by utilizing data based on ontological inference, as well as phylogenetic propagation, to reduce overall data loss. Using inference enabled by ontology-based annotations, missing data were reduced from 98.0% to 85.9%, and further reduced to 34.8% by phylogenetic data propagation. These methods allowed us to extend the data to an additional 11,293 species for a total of 12,582 species with trait data. The pectoral fin appears to have been independently lost in a minimum of 19 lineages and the pelvic fin in 48. Though interpretation is limited by lack of phylogenetic resolution at the species level, it appears that following loss, both pectoral and pelvic fins were regained several (3) to many (14) times respectively. Focused investigation into putative regains of the pectoral fin, all within one clade (Anguilliformes), showed that the pectoral fin was regained at least twice following loss. Overall, this study points to specific teleost clades where strategic phylogenetic resolution and genetic investigation will be necessary to understand the pattern and frequency of pectoral fin reversals.
Collapse
Affiliation(s)
- Laura M Jackson
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| | - Pasan C Fernando
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| | - Josh S Hanscom
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, 100 Europa Drive Suite 540, Chapel Hill, NC 27517, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| |
Collapse
|
17
|
Franz NM, Zhang C, Lee J. A logic approach to modelling nomenclatural change. Cladistics 2018; 34:336-357. [PMID: 34645079 DOI: 10.1111/cla.12201] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/10/2017] [Indexed: 11/27/2022] Open
Abstract
We utilize an Answer Set Programming (ASP) approach to show that the principles of nomenclature are tractable in computational logic. To this end we design a hypothetical, 20 nomenclatural taxon use case, with starting conditions that embody several overarching principles of the International Code of Zoological Nomenclature, including Binomial Nomenclature, Priority, Coordination, Homonymy, Typification and the structural requirement of Gender Agreement. The use case ending conditions are triggered by the reinterpretation of the diagnostic features of one of 12 type specimens anchoring the corresponding species-level epithets. Permutations of this child-to-parent reassignment action lead to 36 alternative scenarios, where each scenario requires a set of 1-14 logically contingent nomenclatural emendations. We show that an ASP transition system approach can correctly infer the Code-mandated changes for each scenario, and visually output the ending conditions. The results provide a foundation for further developing logic-based nomenclatural change optimization and validation services, which could be applied in global nomenclatural registries. More generally, logic explorations of nomenclatural and taxonomic change scenarios provide a novel means of assessing design biases inherent in the principles of nomenclature, and can therefore inform the design of future, big data-compatible identifier systems that recognize and mitigate these constraints.
Collapse
Affiliation(s)
- Nico M Franz
- School of Life Sciences, Arizona State University, PO Box 874501, Tempe, AZ, 85287-4501, USA
| | - Chao Zhang
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, PO Box 878809, Tempe, AZ, 85287-8809, USA
| | - Joohyung Lee
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, PO Box 878809, Tempe, AZ, 85287-8809, USA
| |
Collapse
|
18
|
|
19
|
Mozzherin DY, Myltsev AA, Patterson DJ. "gnparser": a powerful parser for scientific names based on Parsing Expression Grammar. BMC Bioinformatics 2017; 18:279. [PMID: 28549446 PMCID: PMC5446698 DOI: 10.1186/s12859-017-1663-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 04/28/2017] [Indexed: 11/16/2022] Open
Abstract
Background Scientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as ‘parsing’ the name. Parsing categorizes name’s elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of “Big Data” in biology. Results We introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license. Conclusions Global Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1663-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dmitry Y Mozzherin
- University of Illinois, Illinois Natural History Survey, Species File Group, 1816 South Oak St., Champaign, 61820, IL, USA.
| | | | | |
Collapse
|
20
|
Rees JA, Cranston K. Automated assembly of a reference taxonomy for phylogenetic data synthesis. Biodivers Data J 2017:e12581. [PMID: 28765728 PMCID: PMC5515096 DOI: 10.3897/bdj.5.e12581] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2017] [Accepted: 05/12/2017] [Indexed: 12/24/2022] Open
|
21
|
Abstract
The increasing growth of literature in biodiversity presents challenges to users who need to discover pertinent information in an efficient and timely manner. In response, text mining techniques offer solutions by facilitating the automated discovery of knowledge from large textual data. An important step in text mining is the recognition of concepts via their linguistic realisation, i.e., terms. However, a given concept may be referred to in text using various synonyms or term variants, making search systems likely to overlook documents mentioning less known variants, which are albeit relevant to a query term. Domain-specific terminological resources, which include term variants, synonyms and related terms, are thus important in supporting semantic search over large textual archives. This article describes the use of text mining methods for the automatic construction of a large-scale biodiversity term inventory. The inventory consists of names of species, amongst which naming variations are prevalent. We apply a number of distributional semantic techniques on all of the titles in the Biodiversity Heritage Library, to compute semantic similarity between species names and support the automated construction of the resource. With the construction of our biodiversity term inventory, we demonstrate that distributional semantic models are able to identify semantically similar names that are not yet recorded in existing taxonomies. Such methods can thus be used to update existing taxonomies semi-automatically by deriving semantically related taxonomic names from a text corpus and allowing expert curators to validate them. We also evaluate our inventory as a means to improve search by facilitating automatic query expansion. Specifically, we developed a visual search interface that suggests semantically related species names, which are available in our inventory but not always in other repositories, to incorporate into the search query. An assessment of the interface by domain experts reveals that our query expansion based on related names is useful for increasing the number of relevant documents retrieved. Its exploitation can benefit both users and developers of search engines and text mining applications.
Collapse
|
22
|
Dietrich CH, Dmitriev DA. Insect phylogenetics in the digital age. CURRENT OPINION IN INSECT SCIENCE 2016; 18:48-52. [PMID: 27939710 DOI: 10.1016/j.cois.2016.09.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Accepted: 09/21/2016] [Indexed: 06/06/2023]
Abstract
Insect systematists have long used digital data management tools to facilitate phylogenetic research. Web-based platforms developed over the past several years support creation of comprehensive, openly accessible data repositories and analytical tools that support large-scale collaboration, accelerating efforts to document Earth's biota and reconstruct the Tree of Life. New digital tools have the potential to further enhance insect phylogenetics by providing efficient workflows for capturing and analyzing phylogenetically relevant data. Recent initiatives streamline various steps in phylogenetic studies and provide community access to supercomputing resources. In the near future, automated, web-based systems will enable researchers to complete a phylogenetic study from start to finish using resources linked together within a single portal and incorporate results into a global synthesis.
Collapse
Affiliation(s)
- Christopher H Dietrich
- Illinois Natural History Survey, Prairie Research Institute, University of Illinois, 1816 S Oak St., Champaign, IL 61820, USA.
| | - Dmitry A Dmitriev
- Illinois Natural History Survey, Prairie Research Institute, University of Illinois, 1816 S Oak St., Champaign, IL 61820, USA
| |
Collapse
|
23
|
Franz N, Gilbert E, Ludäscher B, Weakley A. Controlling the taxonomic variable: Taxonomic concept resolution for a southeastern United States herbarium portal. RESEARCH IDEAS AND OUTCOMES 2016. [DOI: 10.3897/rio.2.e10610] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Overview. Taxonomic names are imperfect identifiers of specific and sometimes conflicting taxonomic perspectives in aggregated biodiversity data environments. The inherent ambiguities of names can be mitigated using syntactic and semantic conventions developed under the taxonomic concept approach. These include: (1) representation of taxonomic concept labels (TCLs: name sec. source) to precisely identify name usages and meanings, (2) use of parent/child relationships to assemble separate taxonomic perspectives, and (3) expert provision of Region Connection Calculus articulations (RCC–5: congruence, [inverse] inclusion, overlap, exclusion) that specify how data identified to different-sourced TCLs can be integrated. Application of these conventions greatly increases trust in biodiversity data networks, most of which promote unitary taxonomic 'syntheses' that obscure the actual diversity of expert-held views. Better design solutions allow users to control the taxonomic variable and thereby assess the robustness of their biological inferences under different perspectives. A unique constellation of prior efforts – including the powerful Symbiota collections software platform, the Euler/X multi-taxonomy alignment toolkit, and the "Weakley Flora" which entails 7,000 concepts and more than 75,000 RCC–5 articulations – provides the opportunity to build a first full-scale concept resolution service for SERNEC, the SouthEast Regional Network of Expertise and Collections, currently with 60 member herbaria and 2 million occurrence records.
Intellectual merit. We have developed a multi-dimensional, step-wise plan to transition SERNEC's data culture from name- to concept-based practices. (1) We will engage SERNEC experts through annual, regional workshops and follow-up interactions that will foster buy-in and ultimately the completion of 12 community-identified use cases. (2). We will leverage RCC–5 data from the Weakley Flora and further development of the Euler/X logic reasoning toolkit to provide comprehensive genus- to variety-level concept alignments for at least 10 major flora treatments with highest relevance to SERNEC. The visualizations and estimated > 1 billion inferred concept-to-concept relations will effectively drive specimen data integration in the transformed portal. (3) We will expand Symbiota's taxonomy and occurrence schemas and related user interfaces to support the new concept data, including novel batch and map-based specimen determination modules, with easy output options in Darwin Core Archive format. (4) Through combinations of the new technology, enlisted taxonomic expertise, and SERNEC's large image resources, we will upgrade minimally 80% of all SERNEC specimen identifications from names to the narrowest suitable TCLs, or add "uncertainty" flags to specimens needing further study. (5) We will utilize the novel tools and data to demonstrate how controlling for the taxonomic variable in 12 use cases variously drives the outcomes of evolutionary, ecological, and conservation-based research hypotheses.
Broader impacts. Our project is focused on just one herbarium network, but the potential impact is as wide as Darwin Core or even comparative biology. We believe that trust in networked biodiversity data depends on open and dynamic system designs, allowing expert access and resolution of multiple conflicting views that reflect the complex realities of ongoing taxonomic research. Taking well over 1 million SERNEC records from name- to TCL-resolution will show that "big" specimen data can pass the credibility threshold needed to validate the substantive data mobilization investment. We will mentor one postdoctoral researcher (UNC), two Ph.D. students (ASU, UIUC), and at least 15 undergraduate students (ASU). Each of our workshops will capacitate 10-15 SERNEC experts, who in turn can recruit colleagues and students at their home collections. We will incorporate the project theme and use cases into undergraduate courses taught at six institutions and reaching an estimated 300-500 students annually (10-40% minority students). At each institution, project members will make a systematic effort to recruit new students from underrepresented groups. Our group's leadership of Symbiota (with close ties to iDigBio), SERNEC, and local biodiversity projects and centers will further promote the new data culture. We will create a feature story "Where do plant species occur?" for ASU's popular "Ask A Biologist" website, and a series of undergraduate student-led "How-To" videos that illustrate the use case workflows, including the creation of multi-taxonomy alignments.
Collapse
|