1
|
Karkera N, Acharya S, Palaniappan SK. Leveraging pre-trained language models for mining microbiome-disease relationships. BMC Bioinformatics 2023; 24:290. [PMID: 37468830 PMCID: PMC10357883 DOI: 10.1186/s12859-023-05411-z] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 07/13/2023] [Indexed: 07/21/2023] Open
Abstract
BACKGROUND The growing recognition of the microbiome's impact on human health and well-being has prompted extensive research into discovering the links between microbiome dysbiosis and disease (healthy) states. However, this valuable information is scattered in unstructured form within biomedical literature. The structured extraction and qualification of microbe-disease interactions are important. In parallel, recent advancements in deep-learning-based natural language processing algorithms have revolutionized language-related tasks such as ours. This study aims to leverage state-of-the-art deep-learning language models to extract microbe-disease relationships from biomedical literature. RESULTS In this study, we first evaluate multiple pre-trained large language models within a zero-shot or few-shot learning context. In this setting, the models performed poorly out of the box, emphasizing the need for domain-specific fine-tuning of these language models. Subsequently, we fine-tune multiple language models (specifically, GPT-3, BioGPT, BioMedLM, BERT, BioMegatron, PubMedBERT, BioClinicalBERT, and BioLinkBERT) using labeled training data and evaluate their performance. Our experimental results demonstrate the state-of-the-art performance of these fine-tuned models ( specifically GPT-3, BioMedLM, and BioLinkBERT), achieving an average F1 score, precision, and recall of over [Formula: see text] compared to the previous best of 0.74. CONCLUSION Overall, this study establishes that pre-trained language models excel as transfer learners when fine-tuned with domain and problem-specific data, enabling them to achieve state-of-the-art results even with limited training data for extracting microbiome-disease interactions from scientific publications.
Collapse
Affiliation(s)
| | - Sathwik Acharya
- The Systems Biology Institute, Tokyo, Japan
- PES University, Bengaluru, India
| | - Sucheendra K Palaniappan
- The Systems Biology Institute, Tokyo, Japan.
- Iom Bioworks Pvt Ltd., Bengaluru, India.
- SBX Corporation, Tokyo, Japan.
| |
Collapse
|
2
|
Liu Q, Lee B, Xie L. Small molecule modulation of microbiota: a systems pharmacology perspective. BMC Bioinformatics 2022; 23:403. [PMID: 36175827 PMCID: PMC9523894 DOI: 10.1186/s12859-022-04941-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Microbes are associated with many human diseases and influence drug efficacy. Small-molecule drugs may revolutionize biomedicine by fine-tuning the microbiota on the basis of individual patient microbiome signatures. However, emerging endeavors in small-molecule microbiome drug discovery continue to follow a conventional "one-drug-one-target-one-disease" process. A systematic pharmacology approach that would suppress multiple interacting pathogenic species in the microbiome, could offer an attractive alternative solution. RESULTS We construct a disease-centric signed microbe-microbe interaction network using curated microbe metabolite information and their effects on host. We develop a Signed Random Walk with Restart algorithm for the accurate prediction of effect of microbes on human health and diseases. With a survey on the druggable and evolutionary space of microbe proteins, we find that 8-10% of them can be targeted by existing drugs or drug-like chemicals and that 25% of them have homologs to human proteins. We demonstrate that drugs for diabetes can be the lead compounds for development of microbiota-targeted therapeutics. We further show that the potential drug targets that specifically exist in pathogenic microbes are periplasmic and cellular outer membrane proteins. CONCLUSION The systematic studies of the polypharmacological landscape of the microbiome network may open a new avenue for the small-molecule drug discovery of the microbiome. We believe that the application of systematic method on the polypharmacological investigation could lead to the discovery of novel drug therapies.
Collapse
Affiliation(s)
- Qiao Liu
- Department of Computer Science, Hunter College, The City University of New York, New York, NY, USA
| | - Bohyun Lee
- Ph.D. Program in Computer Science, The City University of New York, New York, NY, USA
| | - Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, NY, USA.
- Ph.D. Program in Computer Science, The City University of New York, New York, NY, USA.
- Ph.D. Program in Biochemistry and Biology, The City University of New York, New York, NY, USA.
- Helen and Robert Appel Alzheimer's Disease Research Institute, Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, NY, USA.
| |
Collapse
|
3
|
Ahmed SAJA, Bapatdhar N, Kumar BP, Ghosh S, Yachie A, Palaniappan SK. Large scale text mining for deriving useful insights: A case study focused on microbiome. Front Physiol 2022; 13:933069. [PMID: 36117696 PMCID: PMC9473635 DOI: 10.3389/fphys.2022.933069] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 07/18/2022] [Indexed: 11/23/2022] Open
Abstract
Text mining has been shown to be an auxiliary but key driver for modeling, data harmonization, and interpretation in bio-medicine. Scientific literature holds a wealth of information and embodies cumulative knowledge and remains the core basis on which mechanistic pathways, molecular databases, and models are built and refined. Text mining provides the necessary tools to automatically harness the potential of text. In this study, we show the potential of large-scale text mining for deriving novel insights, with a focus on the growing field of microbiome. We first collected the complete set of abstracts relevant to the microbiome from PubMed and used our text mining and intelligence platform Taxila for analysis. We drive the usefulness of text mining using two case studies. First, we analyze the geographical distribution of research and study locations for the field of microbiome by extracting geo mentions from text. Using this analysis, we were able to draw useful insights on the state of research in microbiome w. r.t geographical distributions and economic drivers. Next, to understand the relationships between diseases, microbiome, and food which are central to the field, we construct semantic relationship networks between these different concepts central to the field of microbiome. We show how such networks can be useful to derive useful insight with no prior knowledge encoded.
Collapse
Affiliation(s)
| | | | | | - Samik Ghosh
- SBX Corporation Inc., Tokyo, Japan
- The NLP Group, The Systems Biology Institute, Tokyo, Japan
| | - Ayako Yachie
- SBX Corporation Inc., Tokyo, Japan
- The NLP Group, The Systems Biology Institute, Tokyo, Japan
| | - Sucheendra K. Palaniappan
- SBX Corporation Inc., Tokyo, Japan
- The NLP Group, The Systems Biology Institute, Tokyo, Japan
- *Correspondence: Sucheendra K. Palaniappan,
| |
Collapse
|
4
|
Restrepo S, ter Horst E, Zambrano JD, Gunn LH, Molina G, Salazar CA. Hierarchical Bayesian classification methods to identify topics by journal quartile with an application in biological sciences. EDUCATION FOR INFORMATION 2022. [DOI: 10.3233/efi-211546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
This manuscript builds on a novel, automatic, freely-available Bayesian approach to extract information in abstracts and titles to classify research topics by quartile. This approach is demonstrated for all N= 149,129 ISI-indexed publications in biological sciences journals during 2017. A Bayesian multinomial inverse regression approach is used to extract rankings of topics without the need of a pre-defined dictionary. Bigrams are used for extraction of research topics across manuscripts, and rankings of research topics are constructed by quartile. Worldwide and local results (e.g., comparison between two peer/aspirational research institutions in Colombia) are provided, and differences are explored both at the global and local levels. Some topics persist across quartiles, while the relevance of others is quartile-specific. Challenges in sustainable development appear as more prevalent in top quartile journals across institutions, while the two Colombian institutions favour plant and microorganism research. This approach can reduce information inequities, by allowing young/incipient researchers in biological sciences, especially within lower income countries or universities with limited resources, to freely assess the state of the literature and the relative likelihood of publication in higher impact journals by research topic. This can also serve institutions of higher education to identify missing research topics and areas of competitive advantage.
Collapse
Affiliation(s)
| | | | | | - Laura H. Gunn
- University of North Carolina at Charlotte & Imperial College London, USA
| | | | | |
Collapse
|
5
|
Djemiel C, Maron PA, Terrat S, Dequiedt S, Cottin A, Ranjard L. Inferring microbiota functions from taxonomic genes: a review. Gigascience 2022; 11:giab090. [PMID: 35022702 PMCID: PMC8756179 DOI: 10.1093/gigascience/giab090] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 12/02/2021] [Accepted: 12/02/2021] [Indexed: 12/13/2022] Open
Abstract
Deciphering microbiota functions is crucial to predict ecosystem sustainability in response to global change. High-throughput sequencing at the individual or community level has revolutionized our understanding of microbial ecology, leading to the big data era and improving our ability to link microbial diversity with microbial functions. Recent advances in bioinformatics have been key for developing functional prediction tools based on DNA metabarcoding data and using taxonomic gene information. This cheaper approach in every aspect serves as an alternative to shotgun sequencing. Although these tools are increasingly used by ecologists, an objective evaluation of their modularity, portability, and robustness is lacking. Here, we reviewed 100 scientific papers on functional inference and ecological trait assignment to rank the advantages, specificities, and drawbacks of these tools, using a scientific benchmarking. To date, inference tools have been mainly devoted to bacterial functions, and ecological trait assignment tools, to fungal functions. A major limitation is the lack of reference genomes-compared with the human microbiota-especially for complex ecosystems such as soils. Finally, we explore applied research prospects. These tools are promising and already provide relevant information on ecosystem functioning, but standardized indicators and corresponding repositories are still lacking that would enable them to be used for operational diagnosis.
Collapse
Affiliation(s)
- Christophe Djemiel
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Pierre-Alain Maron
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Sébastien Terrat
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Samuel Dequiedt
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Aurélien Cottin
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Lionel Ranjard
- Agroécologie, AgroSup Dijon, INRAE, Université de Bourgogne, Université de Bourgogne Franche-Comté, F-21000 Dijon, France
| |
Collapse
|
6
|
Cappellato M, Baruzzo G, Patuzzi I, Di Camillo B. Modeling Microbial Community Networks: Methods and Tools. Curr Genomics 2021; 22:267-290. [PMID: 35273458 PMCID: PMC8822226 DOI: 10.2174/1389202921999200905133146] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Revised: 07/22/2020] [Accepted: 07/29/2020] [Indexed: 11/22/2022] Open
Abstract
In the current research landscape, microbiota composition studies are of extreme interest, since it has been widely shown that resident microorganisms affect and shape the ecological niche they inhabit. This complex micro-world is characterized by different types of interactions. Understanding these relationships provides a useful tool for decoding the causes and effects of communities' organizations. Next-Generation Sequencing technologies allow to reconstruct the internal composition of the whole microbial community present in a sample. Sequencing data can then be investigated through statistical and computational method coming from network theory to infer the network of interactions among microbial species. Since there are several network inference approaches in the literature, in this paper we tried to shed light on their main characteristics and challenges, providing a useful tool not only to those interested in using the methods, but also to those who want to develop new ones. In addition, we focused on the frameworks used to produce synthetic data, starting from the simulation of network structures up to their integration with abundance models, with the aim of clarifying the key points of the entire generative process.
Collapse
Affiliation(s)
| | | | | | - Barbara Di Camillo
- Address correspondence to this author at the Department of Information Engineering, University of Padova, Padova, Italy; E-mail:
| |
Collapse
|
7
|
Gupta G, Ndiaye A, Filteau M. Leveraging Experimental Strategies to Capture Different Dimensions of Microbial Interactions. Front Microbiol 2021; 12:700752. [PMID: 34646243 PMCID: PMC8503676 DOI: 10.3389/fmicb.2021.700752] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2021] [Accepted: 08/31/2021] [Indexed: 12/27/2022] Open
Abstract
Microorganisms are a fundamental part of virtually every ecosystem on earth. Understanding how collectively they interact, assemble, and function as communities has become a prevalent topic both in fundamental and applied research. Owing to multiple advances in technology, answering questions at the microbial system or network level is now within our grasp. To map and characterize microbial interaction networks, numerous computational approaches have been developed; however, experimentally validating microbial interactions is no trivial task. Microbial interactions are context-dependent, and their complex nature can result in an array of outcomes, not only in terms of fitness or growth, but also in other relevant functions and phenotypes. Thus, approaches to experimentally capture microbial interactions involve a combination of culture methods and phenotypic or functional characterization methods. Here, through our perspective of food microbiologists, we highlight the breadth of innovative and promising experimental strategies for their potential to capture the different dimensions of microbial interactions and their high-throughput application to answer the question; are microbial interaction patterns or network architecture similar along different contextual scales? We further discuss the experimental approaches used to build various types of networks and study their architecture in the context of cell biology and how they translate at the level of microbial ecosystem.
Collapse
Affiliation(s)
- Gunjan Gupta
- Département des Sciences des aliments, Université Laval, Québec, QC, Canada
- Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Québec, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, QC, Canada
| | - Amadou Ndiaye
- Département des Sciences des aliments, Université Laval, Québec, QC, Canada
- Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Québec, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, QC, Canada
| | - Marie Filteau
- Département des Sciences des aliments, Université Laval, Québec, QC, Canada
- Institut sur la Nutrition et les Aliments Fonctionnels (INAF), Québec, QC, Canada
- Institut de Biologie Intégrative et des Systèmes (IBIS), Université Laval, Québec, QC, Canada
| |
Collapse
|
8
|
Park Y, Lee J, Moon H, Choi YS, Rho M. Discovering microbe-disease associations from the literature using a hierarchical long short-term memory network and an ensemble parser model. Sci Rep 2021; 11:4490. [PMID: 33627732 PMCID: PMC7904816 DOI: 10.1038/s41598-021-83966-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Accepted: 02/08/2021] [Indexed: 02/07/2023] Open
Abstract
With recent advances in biotechnology and sequencing technology, the microbial community has been intensively studied and discovered to be associated with many chronic as well as acute diseases. Even though a tremendous number of studies describing the association between microbes and diseases have been published, text mining methods that focus on such associations have been rarely studied. We propose a framework that combines machine learning and natural language processing methods to analyze the association between microbes and diseases. A hierarchical long short-term memory network was used to detect sentences that describe the association. For the sentences determined, two different parse tree-based search methods were combined to find the relation-describing word. The ensemble model of constituency parsing for structural pattern matching and dependency-based relation extraction improved the prediction accuracy. By combining deep learning and parse tree-based extractions, our proposed framework could extract the microbe-disease association with higher accuracy. The evaluation results showed that our system achieved an F-score of 0.8764 and 0.8524 in binary decisions and extracting relation words, respectively. As a case study, we performed a large-scale analysis of the association between microbes and diseases. Additionally, a set of common microbes shared by multiple diseases were also identified in this study. This study could provide valuable information for the major microbes that were studied for a specific disease. The code and data are available at https://github.com/DMnBI/mdi_predictor .
Collapse
Affiliation(s)
- Yesol Park
- Department of Computer Science and Engineering, Hanyang University, Seoul, Korea
| | - Joohong Lee
- Department of Computer Science and Engineering, Hanyang University, Seoul, Korea
| | - Heesang Moon
- Department of Computer Science and Engineering, Hanyang University, Seoul, Korea
| | - Yong Suk Choi
- Department of Computer Science and Engineering, Hanyang University, Seoul, Korea.
| | - Mina Rho
- Department of Computer Science and Engineering, Hanyang University, Seoul, Korea.
- Department of Biomedical Informatics, Hanyang University, Seoul, Korea.
| |
Collapse
|
9
|
Yan S, Wong KC. Context awareness and embedding for biomedical event extraction. Bioinformatics 2020; 36:637-643. [PMID: 31392318 DOI: 10.1093/bioinformatics/btz607] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2018] [Revised: 07/26/2019] [Accepted: 08/06/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Biomedical event extraction is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from the literature. Limited by the event context, the existing event detection models are mostly applicable for a single task. A general and scalable computational model is desiderated for biomedical knowledge management. RESULTS We consider and propose a bottom-up detection framework to identify the events from recognized arguments. To capture the relations between the arguments, we trained a bidirectional long short-term memory network to model their context embedding. Leveraging the compositional attributes, we further derived the candidate samples for training event classifiers. We built our models on the datasets from BioNLP Shared Task for evaluations. Our method achieved the average F-scores of 0.81 and 0.92 on BioNLPST-BGI and BioNLPST-BB datasets, respectively. Comparing with seven state-of-the-art methods, our method nearly doubled the existing F-score performance (0.92 versus 0.56) on the BioNLPST-BB dataset. Case studies were conducted to reveal the underlying reasons. AVAILABILITY AND IMPLEMENTATION https://github.com/cskyan/evntextrc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shankai Yan
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR 999077
| |
Collapse
|
10
|
Zhang Y, Liu T, Chen L, Yang J, Yin J, Zhang Y, Yun Z, Xu H, Ning L, Guo F, Jiang Y, Lin H, Wang D, Huang Y, Huang J. RIscoper: a tool for RNA-RNA interaction extraction from the literature. Bioinformatics 2020; 35:3199-3202. [PMID: 30668649 DOI: 10.1093/bioinformatics/btz044] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 01/09/2019] [Accepted: 01/15/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Numerous experimental and computational studies in the biomedical literature have provided considerable amounts of data on diverse RNA-RNA interactions (RRIs). However, few text mining systems for RRIs information extraction are available. RESULTS RNA Interactome Scoper (RIscoper) represents the first tool for full-scale RNA interactome scanning and was developed for extracting RRIs from the literature based on the N-gram model. Notably, a reliable RRI corpus was integrated in RIscoper, and more than 13 300 manually curated sentences with RRI information were recruited. RIscoper allows users to upload full texts or abstracts, and provides an online search tool that is connected with PubMed (PMID and keyword input), and these capabilities are useful for biologists. RIscoper has a strong performance (90.4% precision and 93.9% recall), integrates natural language processing techniques and has a reliable RRI corpus. AVAILABILITY AND IMPLEMENTATION The standalone software and web server of RIscoper are freely available at www.rna-society.org/riscoper/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Zhang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Tianyuan Liu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Liqun Chen
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Jinxurong Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Jiayi Yin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuncong Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Zhixi Yun
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Xu
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Lin Ning
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fengbiao Guo
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yongshuai Jiang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dong Wang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.,Department of Bioinformatics, School of Basic Medical Science, Southern Medical University, Guangzhou, China
| | - Yan Huang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.,Department of Bioinformatics, School of Basic Medical Science, Southern Medical University, Guangzhou, China
| | - Jian Huang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
11
|
Li X, Fu C, Zhong R, Zhong D, He T, Jiang X. A hybrid deep learning framework for bacterial named entity recognition with domain features. BMC Bioinformatics 2019; 20:583. [PMID: 31787075 PMCID: PMC6886245 DOI: 10.1186/s12859-019-3071-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Microbes have been shown to play a crucial role in various ecosystems. Many human diseases have been proved to be associated with bacteria, so it is essential to extract the interaction between bacteria for medical research and application. At the same time, many bacterial interactions with certain experimental evidences have been reported in biomedical literature. Integrating this knowledge into a database or knowledge graph could accelerate the progress of biomedical research. A crucial and necessary step in interaction extraction (IE) is named entity recognition (NER). However, due to the specificity of bacterial naming, there are still challenges in bacterial named entity recognition. RESULTS In this paper, we propose a novel method for bacterial named entity recognition, which integrates domain features into a deep learning framework combining bidirectional long short-term memory network and convolutional neural network. When domain features are not added, F1-measure of the model achieves 89.14%. After part-of-speech (POS) features and dictionary features are added, F1-measure of the model achieves 89.7%. Hence, our model achieves an advanced performance in bacterial NER with the domain features. CONCLUSIONS We propose an efficient method for bacterial named entity recognition which combines domain features and deep learning models. Compared with the previous methods, the effect of our model has been improved. At the same time, the process of complex manual extraction and feature design are significantly reduced.
Collapse
Affiliation(s)
- Xusheng Li
- School of Computer, Central China Normal University, Wuhan, Hubei China
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei China
| | - Chengcheng Fu
- School of Computer, Central China Normal University, Wuhan, Hubei China
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei China
| | - Ran Zhong
- Collaborative & Innovation Center, Central China Normal University, Wuhan, Hubei China
| | - Duo Zhong
- School of Computer, Central China Normal University, Wuhan, Hubei China
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei China
| | - Tingting He
- School of Computer, Central China Normal University, Wuhan, Hubei China
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei China
| | - Xingpeng Jiang
- School of Computer, Central China Normal University, Wuhan, Hubei China
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei China
| |
Collapse
|
12
|
Badal VD, Wright D, Katsis Y, Kim HC, Swafford AD, Knight R, Hsu CN. Challenges in the construction of knowledge bases for human microbiome-disease associations. MICROBIOME 2019; 7:129. [PMID: 31488215 PMCID: PMC6728997 DOI: 10.1186/s40168-019-0742-2] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2019] [Accepted: 08/20/2019] [Indexed: 05/05/2023]
Abstract
The last few years have seen tremendous growth in human microbiome research, with a particular focus on the links to both mental and physical health and disease. Medical and experimental settings provide initial sources of information about these links, but individual studies produce disconnected pieces of knowledge bounded in context by the perspective of expert researchers reading full-text publications. Building a knowledge base (KB) consolidating these disconnected pieces is an essential first step to democratize and accelerate the process of accessing the collective discoveries of human disease connections to the human microbiome. In this article, we survey the existing tools and development efforts that have been produced to capture portions of the information needed to construct a KB of all known human microbiome-disease associations and highlight the need for additional innovations in natural language processing (NLP), text mining, taxonomic representations, and field-wide vocabulary standardization in human microbiome research. Addressing these challenges will enable the construction of KBs that help identify new insights amenable to experimental validation and potentially clinical decision support.
Collapse
Affiliation(s)
- Varsha Dave Badal
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Dustin Wright
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Yannis Katsis
- Scalable Knowledge Intelligence, IBM Research-Almaden, 650 Harry Road, San Jose, CA 95120 USA
| | - Ho-Cheol Kim
- Scalable Knowledge Intelligence, IBM Research-Almaden, 650 Harry Road, San Jose, CA 95120 USA
| | - Austin D. Swafford
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Rob Knight
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- UCSD Health Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| | - Chun-Nan Hsu
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
- Department of Neurosciences and Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093 USA
| |
Collapse
|
13
|
Li C, Chng KR, Kwah JS, Av-Shalom TV, Tucker-Kellogg L, Nagarajan N. An expectation-maximization algorithm enables accurate ecological modeling using longitudinal microbiome sequencing data. MICROBIOME 2019; 7:118. [PMID: 31439018 PMCID: PMC6706891 DOI: 10.1186/s40168-019-0729-z] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 08/13/2019] [Indexed: 05/05/2023]
Abstract
BACKGROUND The dynamics of microbial communities is driven by a range of interactions from symbiosis to predator-prey relationships, the majority of which are poorly understood. With the increasing availability of high-throughput microbiome taxonomic profiling data, it is now conceivable to directly learn the ecological models that explicitly define microbial interactions and explain community dynamics. The applicability of these approaches is severely limited by the lack of accurate absolute cell density measurements (biomass). METHODS We present a new computational approach that resolves this key limitation in the inference of generalized Lotka-Volterra models (gLVMs) by coupling biomass estimation and model inference with an expectation-maximization algorithm (BEEM). RESULTS BEEM outperforms the state-of-the-art methods for inferring gLVMs, while simultaneously eliminating the need for additional experimental biomass data as input. BEEM's application to previously inaccessible public datasets (due to the lack of biomass data) allowed us to construct ecological models of microbial communities in the human gut on a per-individual basis, revealing personalized dynamics and keystone species. CONCLUSIONS BEEM addresses a key bottleneck in "systems analysis" of microbiomes by enabling accurate inference of ecological models from high throughput sequencing data without the need for experimental biomass measurements.
Collapse
Affiliation(s)
- Chenhao Li
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672 Singapore
- School of Computing, National University of Singapore, Singapore, 117543 Singapore
| | - Kern Rei Chng
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672 Singapore
| | - Junmei Samantha Kwah
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672 Singapore
| | - Tamar V. Av-Shalom
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672 Singapore
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, V6T 1Z3 Canada
- Department of Computer Science, University of British Columbia, Vancouver, V6T 1Z4 Canada
| | - Lisa Tucker-Kellogg
- Centre for Computational Biology, Duke–NUS Graduate Medical School, Singapore, 169857 Singapore
| | - Niranjan Nagarajan
- Computational and Systems Biology, Genome Institute of Singapore, Singapore, 138672 Singapore
- School of Computing, National University of Singapore, Singapore, 117543 Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 119228 Singapore
| |
Collapse
|
14
|
Wang X, Li Y, He T, Jiang X, Hu X. Recognition of bacteria named entity using conditional random fields in Spark. BMC SYSTEMS BIOLOGY 2018; 12:106. [PMID: 30463540 PMCID: PMC6249713 DOI: 10.1186/s12918-018-0625-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Background Microbe plays a crucial role in the functional mechanism of an ecosystem. Identification of the interactions among microbes is an important step towards understand the structure and function of microbial communities, as well as of the impact of microbes on human health and disease. Despite the importance of it, there is not a gold-standard dataset of microbial interactions currently. Traditional approaches such as growth and co-culture analysis need to be performed in the laboratory, which are time-consuming and costly. By providing predicted candidate interactions to experimental verification, computational methods are able to alleviate this problem. Mining microbial interactions from mass medical texts is one type of computational methods. Identification of the named entity of bacteria and related entities from the text is the basis for microbial relation extraction. In the previous work, a system of bacteria named entities recognition based on the dictionary and conditional random field was proposed. However, it is inefficient when dealing with large-scale text. Results We implemented bacteria named entity recognition on Spark platform and designed experiments for comparison to verify the correctness and validity of the proposed system. The experimental results show that it can achieve higher F-Measure on the comparison of correctness. Moreover, the predicting speed is much faster than the previous version in large-scale biomedical datasets, and the computational efficiency is improved remarkably by about 3.1 to 6.7 times. Conclusions The system for bacteria named entity recognition solves the inefficiency of the previous proposed system on large-scale datasets. The proposed system has good performance in accuracy and scalability.
Collapse
Affiliation(s)
- Xiaoyan Wang
- School of Computer, Central China Normal University, Wuhan, Hubei, China
| | - Yichuan Li
- School of Computer, Central China Normal University, Wuhan, Hubei, China
| | - Tingting He
- School of Computer, Central China Normal University, Wuhan, Hubei, China
| | - Xingpeng Jiang
- School of Computer, Central China Normal University, Wuhan, Hubei, China.
| | - Xiaohua Hu
- School of Computer, Central China Normal University, Wuhan, Hubei, China. .,College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.
| |
Collapse
|
15
|
Liang D, Leung RKK, Guan W, Au WW. Involvement of gut microbiome in human health and disease: brief overview, knowledge gaps and research opportunities. Gut Pathog 2018; 10:3. [PMID: 29416567 PMCID: PMC5785832 DOI: 10.1186/s13099-018-0230-4] [Citation(s) in RCA: 129] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Accepted: 01/16/2018] [Indexed: 02/06/2023] Open
Abstract
The commensal, symbiotic, and pathogenic microbial community which resides inside our body and on our skin (the human microbiome) can perturb host energy metabolism and immunity, and thus significantly influence development of a variety of human diseases. Therefore, the field has attracted unprecedented attention in the last decade. Although a large amount of data has been generated, there are still many unanswered questions and no universal agreements on how microbiome affects human health have been agreed upon. Consequently, this review was written to provide an updated overview of the rapidly expanding field, with a focus on revealing knowledge gaps and research opportunities. Specifically, the review covered animal physiology, optimal microbiome standard, health intervention by manipulating microbiome, knowledge base building by text mining, microbiota community structure and its implications in human diseases and health monitoring by analyzing microbiome in the blood. The review should enhance interest in conducting novel microbiota investigations that will further improve health and therapy.
Collapse
Affiliation(s)
- Dachao Liang
- Division of Genomics and Bioinformatics, CUHK-BGI Innovation Institute of Trans-omics Hong Kong, Hong Kong SAR, China
| | - Ross Ka-Kit Leung
- 2State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, First Affiliated Hospital of Guangzhou Medical University, Guangzhou, Guangdong China
| | - Wenda Guan
- 2State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, First Affiliated Hospital of Guangzhou Medical University, Guangzhou, Guangdong China
| | - William W Au
- 3University of Medicine and Pharmacy, Tirgu Mures, Romania.,4Shantou University Medical College, Shantou, China
| |
Collapse
|
16
|
Lo C, Marculescu R. MPLasso: Inferring microbial association networks using prior microbial knowledge. PLoS Comput Biol 2017; 13:e1005915. [PMID: 29281638 PMCID: PMC5760079 DOI: 10.1371/journal.pcbi.1005915] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2017] [Revised: 01/09/2018] [Accepted: 12/05/2017] [Indexed: 01/21/2023] Open
Abstract
Due to the recent advances in high-throughput sequencing technologies, it becomes possible to directly analyze microbial communities in human body and environment. To understand how microbial communities adapt, develop, and interact with the human body and the surrounding environment, one of the fundamental challenges is to infer the interactions among different microbes. However, due to the compositional and high-dimensional nature of microbial data, statistical inference cannot offer reliable results. Consequently, new approaches that can accurately and robustly estimate the associations (putative interactions) among microbes are needed to analyze such compositional and high-dimensional data. We propose a novel framework called Microbial Prior Lasso (MPLasso) which integrates graph learning algorithm with microbial co-occurrences and associations obtained from scientific literature by using automated text mining. We show that MPLasso outperforms existing models in terms of accuracy, microbial network recovery rate, and reproducibility. Furthermore, the association networks we obtain from the Human Microbiome Project datasets show credible results when compared against laboratory data.
Collapse
Affiliation(s)
- Chieh Lo
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Radu Marculescu
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
17
|
Jiang X, Hu X. Data Analysis for Gut Microbiota and Health. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2017; 1028:79-87. [PMID: 29058217 DOI: 10.1007/978-981-10-6041-0_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
In recent years, data mining and analysis of high-throughput sequencing of microbiomes and metagenomic data enable researchers to discover biological knowledge by characterizing the composition and variation of species across environmental samples and to accumulate a huge amount of data, making it feasible to infer the complex principle of species interactions. The interactions of microbes in a microbial community play an important role in microbial ecological system. Data mining provides diverse approachs to identify the correlations between disease and microbes and how microbial species coexist and interact in a host-associated or natural environment. This is not only important to advance basic microbiology science and other related fields but also important to understand the impacts of microbial communities on human health and diseases.
Collapse
Affiliation(s)
- Xingpeng Jiang
- School of Computer, Central China Normal University, Wuhan, Hubei, 430079, China.
| | - Xiaohua Hu
- School of Computer, Central China Normal University, Wuhan, Hubei, 430079, China.,College of Computing & Informatics, Drexel University, Philadelphia, PA, 19104, USA
| |
Collapse
|
18
|
Golestan Hashemi FS, Razi Ismail M, Rafii Yusop M, Golestan Hashemi MS, Nadimi Shahraki MH, Rastegari H, Miah G, Aslani F. Intelligent mining of large-scale bio-data: Bioinformatics applications. BIOTECHNOL BIOTEC EQ 2017. [DOI: 10.1080/13102818.2017.1364977] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Farahnaz Sadat Golestan Hashemi
- Plant Genetics, AgroBioChem Department, Gembloux Agro-Bio Tech, University of Liege, Liege, Belgium
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Razi Ismail
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Rafii Yusop
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mahboobe Sadat Golestan Hashemi
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Mohammad Hossein Nadimi Shahraki
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Hamid Rastegari
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
| | - Gous Miah
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Farzad Aslani
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| |
Collapse
|
19
|
Renganathan V. Text Mining in Biomedical Domain with Emphasis on Document Clustering. Healthc Inform Res 2017; 23:141-146. [PMID: 28875048 PMCID: PMC5572517 DOI: 10.4258/hir.2017.23.3.141] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Revised: 07/16/2017] [Accepted: 07/17/2017] [Indexed: 12/19/2022] Open
Abstract
Objectives With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. Methods This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Results Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Conclusions Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.
Collapse
|
20
|
Abstract
As we all know, the microbiota show remarkable variability within individuals. At the same time, those microorganisms living in the human body play a very important role in our health and disease, so the identification of the relationships between microbes and diseases will contribute to better understanding of microbes interactions, mechanism of functions. However, the microbial data which are obtained through the related technical sequencing is too much, but the known associations between the diseases and microbes are very less. In bioinformatics, many researchers choose the network topology analysis to solve these problems. Inspired by this idea, we proposed a new method for prioritization of candidate microbes to predict potential disease-microbe association. First of all, we connected the disease network and microbe network based on the known disease-microbe relationships information to construct a heterogeneous network, then we extended the random walk to the heterogeneous network, and used leave-one-out cross-validation and ROC curve to evaluate the method. In conclusion, the algorithm could be effective to disclose some potential associations between diseases and microbes that cannot be found by microbe network or disease network only. Furthermore, we studied three representative diseases, Type 2 diabetes, Asthma and Psoriasis, and finally presented the potential microbes associated with these diseases by ranking candidate disease-causing microbes, respectively. We confirmed that the discovery of the new associations will be a good clinical solution for disease mechanism understanding, diagnosis and therapy.
Collapse
|