1
|
Ramanan V, Mechery S, Sarkar IN. GenBank as a source to monitor and analyze Host-Microbiome data. Bioinformatics 2022; 38:4172-4177. [PMID: 35801940 PMCID: PMC9438952 DOI: 10.1093/bioinformatics/btac487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 06/08/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Microbiome datasets are often constrained by sequencing limitations. GenBank is the largest collection of publicly available DNA sequences, which is maintained by the National Center of Biotechnology Information (NCBI). The metadata of GenBank records are a largely understudied resource and may be uniquely leveraged to access the sum of prior studies focused on microbiome composition. Here, we developed a computational pipeline to analyze GenBank metadata, containing data on hosts, microorganisms and their place of origin. This work provides the first opportunity to leverage the totality of GenBank to shed light on compositional data practices that shape how microbiome datasets are formed as well as examine host-microbiome relationships. RESULTS The collected dataset contains multiple kingdoms of microorganisms, consisting of bacteria, viruses, archaea, protozoa, fungi, and invertebrate parasites, and hosts of multiple taxonomical classes, including mammals, birds and fish. A human data subset of this dataset provides insights to gaps in current microbiome data collection, which is biased towards clinically relevant pathogens. Clustering and phylogenic analysis reveals the potential to use these data to model host taxonomy and evolution, revealing groupings formed by host diet, environment and coevolution. AVAILABILITY AND IMPLEMENTATION GenBank Host-Microbiome Pipeline is available at https://github.com/bcbi/genbank_holobiome. The GenBank loader is available at https://github.com/bcbi/genbank_loader. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Vivek Ramanan
- Center of Computational Molecular Biology Brown University, Providence, RI, USA
- Center for Biomedical Informatics Brown University, Providence, RI, USA
| | - Shanti Mechery
- Center for Biomedical Informatics Brown University, Providence, RI, USA
| | - Indra Neil Sarkar
- Center of Computational Molecular Biology Brown University, Providence, RI, USA
- Center for Biomedical Informatics Brown University, Providence, RI, USA
- Rhode Island Quality Institute, Providence, RI, USA
| |
Collapse
|
2
|
Acheson E, Purves RS. Extracting and modeling geographic information from scientific articles. PLoS One 2021; 16:e0244918. [PMID: 33406109 PMCID: PMC7787447 DOI: 10.1371/journal.pone.0244918] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2019] [Accepted: 12/20/2020] [Indexed: 11/29/2022] Open
Abstract
Scientific articles often contain relevant geographic information such as where field work was performed or where patients were treated. Most often, this information appears in the full-text article contents as a description in natural language including place names, with no accompanying machine-readable geographic metadata. Automatically extracting this geographic information could help conduct meta-analyses, find geographical research gaps, and retrieve articles using spatial search criteria. Research on this problem is still in its infancy, with many works manually processing corpora for locations and few cross-domain studies. In this paper, we develop a fully automatic pipeline to extract and represent relevant locations from scientific articles, applying it to two varied corpora. We obtain good performance, with full pipeline precision of 0.84 for an environmental corpus, and 0.78 for a biomedical corpus. Our results can be visualized as simple global maps, allowing human annotators to both explore corpus patterns in space and triage results for downstream analysis. Future work should not only focus on improving individual pipeline components, but also be informed by user needs derived from the potential spatial analysis and exploration of such corpora.
Collapse
Affiliation(s)
- Elise Acheson
- Department of Geography, University of Zurich, Zurich, Switzerland
- * E-mail:
| | - Ross S. Purves
- Department of Geography, University of Zurich, Zurich, Switzerland
| |
Collapse
|
3
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Deep neural networks and distant supervision for geographic location mention extraction. Bioinformatics 2019; 34:i565-i573. [PMID: 29950020 PMCID: PMC6022665 DOI: 10.1093/bioinformatics/bty273] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Motivation Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
Collapse
Affiliation(s)
- Arjun Magge
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
4
|
Tahsin T, Weissenbacher D, O'Connor K, Magge A, Scotch M, Gonzalez-Hernandez G. GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records. Bioinformatics 2019; 34:1606-1608. [PMID: 29240889 DOI: 10.1093/bioinformatics/btx799] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2017] [Accepted: 12/11/2017] [Indexed: 11/13/2022] Open
Abstract
Summary GeoBoost is a command-line software package developed to address sparse or incomplete metadata in GenBank sequence records that relate to the location of the infected host (LOIH) of viruses. Given a set of GenBank accession numbers corresponding to virus GenBank records, GeoBoost extracts, integrates and normalizes geographic information reflecting the LOIH of the viruses using integrated information from GenBank metadata and related full-text publications. In addition, to facilitate probabilistic geospatial modeling, GeoBoost assigns probability scores for each possible LOIH. Availability and implementation Binaries and resources required for running GeoBoost are packed into a single zipped file and freely available for download at https://tinyurl.com/geoboost. A video tutorial is included to help users quickly and easily install and run the software. The software is implemented in Java 1.8, and supported on MS Windows and Linux platforms. Contact gragon@upenn.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Karen O'Connor
- Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Arjun Magge
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
5
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019; 24:100-111. [PMID: 30864314 PMCID: PMC6417823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
6
|
Weissenbacher D, Sarker A, Tahsin T, Scotch M, Gonzalez G. Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2017; 2017:114-122. [PMID: 28815119 PMCID: PMC5543364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.
Collapse
Affiliation(s)
| | - Abeed Sarker
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | | | | | |
Collapse
|
7
|
Pérez-Pérez M, Pérez-Rodríguez G, Fdez-Riverola F, Lourenço A. Collaborative relation annotation and quality analysis in Markyt environment. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:4693828. [PMID: 29220479 PMCID: PMC5737204 DOI: 10.1093/database/bax090] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 11/09/2017] [Indexed: 11/30/2022]
Abstract
Text mining is showing potential to help in biomedical knowledge integration and discovery at various levels. However, results depend largely on the specifics of the knowledge problem and, in particular, on the ability to produce high-quality benchmarking corpora that may support the training and evaluation of automatic prediction systems. Annotation tools enabling the flexible and customizable production of such corpora are thus pivotal. The open-source Markyt annotation environment brings together the latest web technologies to offer a wide range of annotation capabilities in a domain-agnostic way. It enables the management of multi-user and multi-round annotation projects, including inter-annotator agreement and consensus assessments. Also, Markyt supports the description of entity and relation annotation guidelines on a project basis, being flexible to partial word tagging and the occurrence of annotation overlaps. This paper describes the current release of Markyt, namely new annotation perspectives, which enable the annotation of relations among entities, and enhanced analysis capabilities. Several demos, inspired by public biomedical corpora, are presented as means to better illustrate such functionalities. Markyt aims to bring together annotation capabilities of broad interest to those producing annotated corpora. Markyt demonstration projects describe 20 different annotation tasks of varied document sources (e.g. abstracts, twitters or drug labels) and languages (e.g. English, Spanish or Chinese). Continuous development is based on feedback from practical applications as well as community reports on short- and medium-term mining challenges. Markyt is freely available for non-commercial use at http://markyt.org. Database URL:http://markyt.org
Collapse
Affiliation(s)
- Martín Pérez-Pérez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Gael Pérez-Rodríguez
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Florentino Fdez-Riverola
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain
| | - Anália Lourenço
- ESEI-Department of Computer Science, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas S/N 32004, Ourense, Spain.,CINBIO-Centro de Investigaciones Biomédicas, University of Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo, Spain.,CEB-Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
8
|
Tahsin T, Weissenbacher D, Jones-Shargani D, Magee D, Vaiente M, Gonzalez G, Scotch M. Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. Database (Oxford) 2017; 2017:4781736. [PMID: 30412219 PMCID: PMC6225896 DOI: 10.1093/database/bax093] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2017] [Revised: 11/20/2017] [Accepted: 11/21/2017] [Indexed: 02/06/2023]
Abstract
DATABASE URL : https://zodo.asu.edu/zoophydb/.
Collapse
Affiliation(s)
- Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Demetrius Jones-Shargani
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Daniel Magee
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Matteo Vaiente
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| |
Collapse
|
9
|
Tahsin T, Weissenbacher D, Rivera R, Beard R, Firago M, Wallstrom G, Scotch M, Gonzalez G. A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. J Am Med Inform Assoc 2016; 23:934-41. [PMID: 26911818 PMCID: PMC4997033 DOI: 10.1093/jamia/ocv172] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2015] [Revised: 10/22/2015] [Accepted: 10/22/2015] [Indexed: 01/09/2023] Open
Abstract
OBJECTIVE The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases. MATERIALS AND METHODS We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus. RESULTS We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. DISCUSSION Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction. CONCLUSION Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.
Collapse
Affiliation(s)
- Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Robert Rivera
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Rachel Beard
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Mari Firago
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Garrick Wallstrom
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| |
Collapse
|