1
|
Page R. Ten years and a million links: building a global taxonomic library connecting persistent identifiers for names, publications and people. Biodivers Data J 2023; 11:e107914. [PMID: 37745899 PMCID: PMC10514697 DOI: 10.3897/bdj.11.e107914] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 09/01/2023] [Indexed: 09/26/2023] Open
Abstract
A major gap in the biodiversity knowledge graph is a connection between taxonomic names and the taxonomic literature. While both names and publications often have persistent identifiers (PIDs), such as Life Science Identifiers (LSIDs) or Digital Object Identifiers (DOIs), LSIDs for names are rarely linked to DOIs for publications. This article describes efforts to make those connections across three large taxonomic databases: Index Fungorum, International Plant Names Index (IPNI) and the Index of Organism Names (ION). Over a million names have been matched to DOIs or other persistent identifiers for taxonomic publications. This represents approximately 36% of names for which publication data are available. The mappings between LSIDs and publication PIDs are made available through ChecklistBank. Applications of this mapping are discussed, including a web app to locate the citation of a taxonomic name and a knowledge graph that uses data on researcher ORCID ids to connect taxonomic names and publications to authors of those names.
Collapse
Affiliation(s)
- Roderic Page
- University of Glasgow, Glasgow, United KingdomUniversity of GlasgowGlasgowUnited Kingdom
| |
Collapse
|
2
|
Conti M, Nimis PL, Martellos S. Match Algorithms for Scientific Names in FlorItaly, the Portal to the Flora of Italy. PLANTS 2021; 10:plants10050974. [PMID: 34068389 PMCID: PMC8153551 DOI: 10.3390/plants10050974] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 05/06/2021] [Accepted: 05/08/2021] [Indexed: 11/21/2022]
Abstract
Scientific names are not part of everyday language in any modern country, and their input as strings in a query system can be easily associated with typographical errors. While globally unique identifiers univocally address a taxon name, they can hardly be used for querying a database manually. Thus, matching algorithms are often used to overcome misspelled names in query systems in several data repositories worldwide. In order to improve users’ experience in the use of FlorItaly, the Portal to the Flora of Italy, a near match algorithm to resolve misspelled scientific names has been integrated in the query systems. In addition, a novel tool in FlorItaly, capable of rapidly aligning any list of names to the nomenclatural backbone provided by the national checklists, has been developed. This manuscript aims at describing the potential of these new tools.
Collapse
|
3
|
Deisboeck TS, Zhang L, Martin S. Advancing Cancer Systems Biology: Introducing the Center for the Development of a Virtual Tumor, CViT. Cancer Inform 2017. [DOI: 10.1177/117693510700500001] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Integrative cancer biology research relies on a variety of data-driven computational modeling and simulation methods and techniques geared towards gaining new insights into the complexity of biological processes that are of critical importance for cancer research. These include the dynamics of gene-protein interaction networks, the percolation of sub-cellular perturbations across scales and the impact they may have on tumorigenesis in both experiments and clinics. Such innovative ‘systems’ research will greatly benefit from enabling Information Technology that is currently under development, including an online collaborative environment, a Semantic Web based computing platform that hosts data and model repositories as well as high-performance computing access. Here, we present one of the National Cancer Institute's recently established Integrative Cancer Biology Programs, i.e. the Center for the Development of a Virtual Tumor, CViT, which is charged with building a cancer modeling community, developing the aforementioned enabling technologies and fostering multi-scale cancer modeling and simulation.
Collapse
Affiliation(s)
- Thomas S. Deisboeck
- Complex Biosystems Modeling Laboratory, Harvard-MIT (HST) Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, MA 02129
| | - Le Zhang
- Complex Biosystems Modeling Laboratory, Harvard-MIT (HST) Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, MA 02129
| | - Sean Martin
- IBM Advanced Internet Technology, Cambridge, MA 02142
| |
Collapse
|
4
|
Abstract
Taxonomic databases are perpetuating approaches to citing literature that may have been appropriate before the Internet, often being little more than digitised 5 × 3 index cards. Typically the original taxonomic literature is either not cited, or is represented in the form of a (typically abbreviated) text string. Hence much of the "deep data" of taxonomy, such as the original descriptions, revisions, and nomenclatural actions are largely hidden from all but the most resourceful users. At the same time there are burgeoning efforts to digitise the scientific literature, and much of this newly available content has been assigned globally unique identifiers such as Digital Object Identifiers (DOIs), which are also the identifier of choice for most modern publications. This represents an opportunity for taxonomic databases to engage with digitisation efforts. Mapping the taxonomic literature on to globally unique identifiers can be time consuming, but need be done only once. Furthermore, if we reuse existing identifiers, rather than mint our own, we can start to build the links between the diverse data that are needed to support the kinds of inference which biodiversity informatics aspires to support. Until this practice becomes widespread, the taxonomic literature will remain balkanized, and much of the knowledge that it contains will linger in obscurity.
Collapse
Affiliation(s)
- Roderic D M Page
- Institute of Biodiversity, Animal Health, and Comparative Medicine, College of Medical, Veterinary, and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK
| |
Collapse
|
5
|
Stucky BJ, Deck J, Conlin T, Ziemba L, Cellinese N, Guralnick R. The BiSciCol Triplifier: bringing biodiversity data to the Semantic Web. BMC Bioinformatics 2014; 15:257. [PMID: 25073721 PMCID: PMC4124153 DOI: 10.1186/1471-2105-15-257] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 07/22/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent years have brought great progress in efforts to digitize the world's biodiversity data, but integrating data from many different providers, and across research domains, remains challenging. Semantic Web technologies have been widely recognized by biodiversity scientists for their potential to help solve this problem, yet these technologies have so far seen little use for biodiversity data. Such slow uptake has been due, in part, to the relative complexity of Semantic Web technologies along with a lack of domain-specific software tools to help non-experts publish their data to the Semantic Web. RESULTS The BiSciCol Triplifier is new software that greatly simplifies the process of converting biodiversity data in standard, tabular formats, such as Darwin Core-Archives, into Semantic Web-ready Resource Description Framework (RDF) representations. The Triplifier uses a vocabulary based on the popular Darwin Core standard, includes both Web-based and command-line interfaces, and is fully open-source software. CONCLUSIONS Unlike most other RDF conversion tools, the Triplifier does not require detailed familiarity with core Semantic Web technologies, and it is tailored to a widely popular biodiversity data format and vocabulary standard. As a result, the Triplifier can often fully automate the conversion of biodiversity data to RDF, thereby making the Semantic Web much more accessible to biodiversity scientists who might otherwise have relatively little knowledge of Semantic Web technologies. Easy availability of biodiversity data as RDF will allow researchers to combine data from disparate sources and analyze them with powerful linked data querying tools. However, before software like the Triplifier, and Semantic Web technologies in general, can reach their full potential for biodiversity science, the biodiversity informatics community must address several critical challenges, such as the widespread failure to use robust, globally unique identifiers for biodiversity data.
Collapse
Affiliation(s)
- Brian J Stucky
- Department of Ecology and Evolutionary Biology, University of Colorado, Boulder, Colorado, USA.
| | | | | | | | | | | |
Collapse
|
6
|
Page RDM. BioNames: linking taxonomy, texts, and trees. PeerJ 2013; 1:e190. [PMID: 24244913 PMCID: PMC3817598 DOI: 10.7717/peerj.190] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2013] [Accepted: 10/08/2013] [Indexed: 11/20/2022] Open
Abstract
BioNames is a web database of taxonomic names for animals, linked to the primary literature and, wherever possible, to phylogenetic trees. It aims to provide a taxonomic “dashboard” where at a glance we can see a summary of the taxonomic and phylogenetic information we have for a given taxon and hence provide a quick answer to the basic question “what is this taxon?” BioNames combines classifications from the Global Biodiversity Information Facility (GBIF) and GenBank, images from the Encyclopedia of Life (EOL), animal names from the Index of Organism Names (ION), and bibliographic data from multiple sources including the Biodiversity Heritage Library (BHL) and CrossRef. The user interface includes display of full text articles, interactive timelines of taxonomic publications, and zoomable phylogenies. It is available at http://bionames.org.
Collapse
Affiliation(s)
- Roderic D M Page
- Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, Graham Kerr Building, University of Glasgow , Glasgow , UK
| |
Collapse
|
7
|
Jones AC, White RJ, Orme ER. Identifying and relating biological concepts in the Catalogue of Life. J Biomed Semantics 2011; 2:7. [PMID: 22004596 PMCID: PMC3245425 DOI: 10.1186/2041-1480-2-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Accepted: 10/17/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In this paper we describe our experience of adding globally unique identifiers to the Species 2000 and ITIS Catalogue of Life, an on-line index of organisms which is intended, ultimately, to cover all the world's known species. The scientific species names held in the Catalogue are names that already play an extensive role as terms in the organisation of information about living organisms in bioinformatics and other domains, but the effectiveness of their use is hindered by variation in individuals' opinions and understanding of these terms; indeed, in some cases more than one name will have been used to refer to the same organism. This means that it is desirable to be able to give unique labels to each of these differing concepts within the catalogue and to be able to determine which concepts are being used in other systems, in order that they can be associated with the concepts in the catalogue. Not only is this needed, but it is also necessary to know the relationships between alternative concepts that scientists might have employed, as these determine what can be inferred when data associated with related concepts is being processed. A further complication is that the catalogue itself is evolving as scientific opinion changes due to an increasing understanding of life. RESULTS We describe how we are using Life Science Identifiers (LSIDs) as globally unique identifiers in the Catalogue of Life, explaining how the mapping to species concepts is performed, how concepts are associated with specific editions of the catalogue, and how the Taxon Concept Schema has been adopted in order to express information about concepts and their relationships. We explore the implications of using globally unique identifiers in order to refer to abstract concepts such as species, which incorporate at least a measure of subjectivity in their definition, in contrast with the more traditional use of such identifiers to refer to more tangible entities, events, documents, observations, etc. CONCLUSIONS A major reason for adopting identifiers such as LSIDs is to facilitate data integration. We have demonstrated the incorporation of LSIDs into the Catalogue of Life, in a manner consistent with the biodiversity informatics community's conventions for LSID use. The Catalogue of Life is therefore available as a taxonomy of organisms for use within various disciplines, including biomedical research, by software written with an awareness of these conventions.
Collapse
Affiliation(s)
- Andrew C Jones
- Cardiff School of Computer Science & Informatics, Cardiff University, Queen's Buildings, 5 The Parade, Cardiff CF24 3AA, UK.
| | | | | |
Collapse
|
8
|
Sreenivasaiah PK, Kim DH. Current trends and new challenges of databases and web applications for systems driven biological research. Front Physiol 2010; 1:147. [PMID: 21423387 PMCID: PMC3059952 DOI: 10.3389/fphys.2010.00147] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2010] [Accepted: 10/18/2010] [Indexed: 12/17/2022] Open
Abstract
Dynamic and rapidly evolving nature of systems driven research imposes special requirements on the technology, approach, design and architecture of computational infrastructure including database and Web application. Several solutions have been proposed to meet the expectations and novel methods have been developed to address the persisting problems of data integration. It is important for researchers to understand different technologies and approaches. Having familiarized with the pros and cons of the existing technologies, researchers can exploit its capabilities to the maximum potential for integrating data. In this review we discuss the architecture, design and key technologies underlying some of the prominent databases and Web applications. We will mention their roles in integration of biological data and investigate some of the emerging design concepts and computational technologies that are likely to have a key role in the future of systems driven biomedical research.
Collapse
Affiliation(s)
- Pradeep Kumar Sreenivasaiah
- Systems Biology Research Center and College of Life Science, Gwangju Institute of Science and TechnologyGwangju, Republic of Korea
| | - Do Han Kim
- Systems Biology Research Center and College of Life Science, Gwangju Institute of Science and TechnologyGwangju, Republic of Korea
| |
Collapse
|
9
|
Page RDM. bioGUID: resolving, discovering, and minting identifiers for biodiversity informatics. BMC Bioinformatics 2009; 10 Suppl 14:S5. [PMID: 19900301 PMCID: PMC2775151 DOI: 10.1186/1471-2105-10-s14-s5] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Linking together the data of interest to biodiversity researchers (including specimen records, images, taxonomic names, and DNA sequences) requires services that can mint, resolve, and discover globally unique identifiers (including, but not limited to, DOIs, HTTP URIs, and LSIDs). Results bioGUID implements a range of services, the core ones being an OpenURL resolver for bibliographic resources, and a LSID resolver. The LSID resolver supports Linked Data-friendly resolution using HTTP 303 redirects and content negotiation. Additional services include journal ISSN look-up, author name matching, and a tool to monitor the status of biodiversity data providers. Conclusion bioGUID is available at . Source code is available from .
Collapse
Affiliation(s)
- Roderic D M Page
- Division of Environmental and Evolutionary Biology, Faculty of Biomedical and Life Sciences, Graham Kerr Building, University of Glasgow, Glasgow G12 8QQ, UK.
| |
Collapse
|
10
|
Marenco L, Ascoli GA, Martone ME, Shepherd GM, Miller PL. The NIF LinkOut broker: a web resource to facilitate federated data integration using NCBI identifiers. Neuroinformatics 2008; 6:219-27. [PMID: 18975149 DOI: 10.1007/s12021-008-9025-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2008] [Accepted: 08/26/2008] [Indexed: 10/21/2022]
Abstract
This paper describes the NIF LinkOut Broker (NLB) that has been built as part of the Neuroscience Information Framework (NIF) project. The NLB is designed to coordinate the assembly of links to neuroscience information items (e.g., experimental data, knowledge bases, and software tools) that are (1) accessible via the Web, and (2) related to entries in the National Center for Biotechnology Information's (NCBI's) Entrez system. The NLB collects these links from each resource and passes them to the NCBI which incorporates them into its Entrez LinkOut service. In this way, an Entrez user looking at a specific Entrez entry can LinkOut directly to related neuroscience information. The information stored in the NLB can also be utilized in other ways. A second approach, which is operational on a pilot basis, is for the NLB Web server to create dynamically its own Web page of LinkOut links for each NCBI identifier in the NLB database. This approach can allow other resources (in addition to the NCBI Entrez) to LinkOut to related neuroscience information. The paper describes the current NLB system and discusses certain design issues that arose during its implementation.
Collapse
Affiliation(s)
- Luis Marenco
- Department of Anesthesiology, Center for Medical Informatics, Yale University School of Medicine, New Haven, CT, 06520-8009, USA.
| | | | | | | | | |
Collapse
|
11
|
Page RDM. Biodiversity informatics: the challenge of linking data and the role of shared identifiers. Brief Bioinform 2008; 9:345-54. [PMID: 18445641 DOI: 10.1093/bib/bbn022] [Citation(s) in RCA: 80] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A major challenge facing biodiversity informatics is integrating data stored in widely distributed databases. Initial efforts have relied on taxonomic names as the shared identifier linking records in different databases. However, taxonomic names have limitations as identifiers, being neither stable nor globally unique, and the pace of molecular taxonomic and phylogenetic research means that a lot of information in public sequence databases is not linked to formal taxonomic names. This review explores the use of other identifiers, such as specimen codes and GenBank accession numbers, to link otherwise disconnected facts in different databases. The structure of these links can also be exploited using the PageRank algorithm to rank the results of searches on biodiversity databases. The key to rich integration is a commitment to deploy and reuse globally unique, shared identifiers [such as Digital Object Identifiers (DOIs) and Life Science Identifiers (LSIDs)], and the implementation of services that link those identifiers.
Collapse
Affiliation(s)
- Roderic D M Page
- Division of Environmental and Evolutional Biology, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK.
| |
Collapse
|
12
|
Xu Q, Shi Y, Lu Q, Zhang G, Luo Q, Li Y. GORouter: an RDF model for providing semantic query and inference services for Gene Ontology and its associations. BMC Bioinformatics 2008; 9 Suppl 1:S6. [PMID: 18315859 PMCID: PMC2259407 DOI: 10.1186/1471-2105-9-s1-s6] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background The most renowned biological ontology, Gene Ontology (GO) is widely used for annotations of genes and gene products of different organisms. However, there are shortcomings in the Resource Description Framework (RDF) data file provided by the GO consortium: 1) Lack of sufficient semantic relationships between pairs of terms coming from the three independent GO sub-ontologies, that limit the power to provide complex semantic queries and inference services based on it. 2) The term-centric view of GO annotation data and the fact that all information is stored in a single file. This makes attempts to retrieve GO annotations based on big volume datasets unmanageable. 3) No support of GOSlim. Results We propose a RDF model, GORouter, which encodes heterogeneous original data in a uniform RDF format, creates additional ontology mappings between GO terms, and introduces a set of inference rulebases. Furthermore, we use the Oracle Network Data Model (NDM) as the native RDF data repository and the table function RDF_MATCH to seamlessly combine the result of RDF queries with traditional relational data. As a result, the scale of GORouter is minimized; information not directly involved in semantic inference is put into relational tables. Conclusion Our work demonstrates how to use multiple semantic web tools and techniques to provide a mixture of semantic query and inference solutions of GO and its associations. GORouter is licensed under Apache License Version 2.0, and is accessible via the website: .
Collapse
Affiliation(s)
- Qingwei Xu
- The Key Laboratory of Biomedical Photonics of the Ministry of Education, HUST, Wuhan 430074, China.
| | | | | | | | | | | |
Collapse
|
13
|
LSID Tester, a tool for testing Life Science Identifier resolution services. SOURCE CODE FOR BIOLOGY AND MEDICINE 2008; 3:2. [PMID: 18282290 PMCID: PMC2276318 DOI: 10.1186/1751-0473-3-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2008] [Accepted: 02/18/2008] [Indexed: 11/30/2022]
Abstract
Background Life Science Identifiers (LSIDs) are persistent, globally unique identifiers for biological objects. The decentralised nature of LSIDs makes them attractive for identifying distributed resources. Data of interest to biodiversity researchers (including specimen records, images, taxonomic names, and DNA sequences) are distributed over many different providers, and this community has adopted LSIDs as the identifier of choice. Results LSID Tester is a web application written in PHP. Given a LSID the application performs seven tests, reporting the results at each step. If all tests are successful the metadata associated with the LSID is displayed, and can be viewed in a range of formats. Conclusion The software provides a tool for testing a LSID resolution service.
Collapse
|
14
|
Xu Q, Huang Y, Liu Q, Zhang G, Li Y, Lu Q. A Semantic Web model of GO and its annotations. CHINESE SCIENCE BULLETIN-CHINESE 2008. [DOI: 10.1007/s11434-008-0137-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
15
|
Laurin M, Cantino PD. Second Meeting of the International Society for Phylogenetic Nomenclature: a Report. ZOOL SCR 2007. [DOI: 10.1111/j.1463-6409.2006.00268.x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|