1
|
Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. A comprehensive large scale biomedical knowledge graph for AI powered data driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
To address the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have become a critical tool for integrating large volumes of heterogeneous data to enable efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured scientific literature into KGs remains a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge (2022) to construct a large-scale KG named iKraph using all PubMed abstracts. The extracted information matches human expert annotations and significantly exceeds the content of manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. This KG facilitates rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and applied it to real-time COVID-19 drug repurposing from March 2020 to May 2023. Our method identified 600-1400 candidate drugs per month, with one-third of those discovered in the first two months later supported by clinical trials or PubMed publications. These outcomes are very challenging to attain through alternative approaches that lack a thorough understanding of the existing literature. A cloud-based platform (https://biokde.insilicom.com) was developed for academic users to access this rich structured data and associated tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| | - Xin Sui
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Feng Pan
- Insilicom LLC, Tallahassee, FL 32303
| | | | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Wanjing Wang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Jian Wang
- 977 Wisteria Ter., Sunnyvale, CA 94086
| | | | | | - Jun Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Eric Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Ben Lee
- Insilicom LLC, Tallahassee, FL 32303
| | - Peili Zhang
- Forward Informatics, Winchester, Massachusetts, 01890
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Tingting Zhao
- Insilicom LLC, Tallahassee, FL 32303
- Department of Geography, Florida State University, Tallahassee, FL 32306
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| |
Collapse
|
2
|
Li J, Zhang H, Wang J, Deng M, Li Z, Jiang W, Xu K, Wu L, Dong Z, Liu J, Ding Q, Yu H. Development and Validation of an AI-Driven System for Automatic Literature Analysis and Molecular Regulatory Network Construction. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2405395. [PMID: 39373342 PMCID: PMC11600262 DOI: 10.1002/advs.202405395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 09/06/2024] [Indexed: 10/08/2024]
Abstract
Decoding gene regulatory networks is essential for understanding the mechanisms underlying many complex diseases. GENET is developed, an automated system designed to extract and visualize extensive molecular relationships from published biomedical literature. Using natural language processing, entities and relations are identified from a randomly selected set of 1788 scientific articles, and visualized in a filterable knowledge graph. The performance of GENET is evaluated and compared with existing methods. The named entity recognition model has achieved an overall precision of 94.23% (4835/5131; 93.56-94.84%), recall of 97.72% (4835/4948; 97.27-98.10%), and an F1 score of 95.94%. The relation extraction model has demonstrated an overall precision of 91.63% (2593/2830; 90.55-92.59%), recall of 89.17% (2593/2908; 87.99-90.25%), and an F1 score of 90.38%. GENET significantly outperforms existing methods in extracting molecular relationships (P < 0.001). Additionally, GENET has successfully predicted WNT family member 4 regulates insulin-like growth factor 2 via signal transducer and activator of transcription 3 in colon cancer. With RNA sequencing data and multiple immunofluorescence, the authenticity of this prediction is validated, supporting the promising feasibility of GENET.
Collapse
Affiliation(s)
- Jia Li
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Hailin Zhang
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Jiamin Wang
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Mei Deng
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Zhiyong Li
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Wei Jiang
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Kejin Xu
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Lianlian Wu
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Zehua Dong
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Jun Liu
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Nursing Department of Renmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
| | - Qianshan Ding
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| | - Honggang Yu
- Department of GastroenterologyRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Key Laboratory of Digestive SystemRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Hubei Provincial Clinical Research Center for Digestive Disease Minimally Invasive IncisionRenmin Hospital of Wuhan UniversityWuhanHubei430060P. R. China
- Engineering Research Center for Artificial Intelligence Endoscopy Interventional Treatment of Hubei ProvinceWuhanHubei430060P. R. China
| |
Collapse
|
3
|
Rehana H, Çam NB, Basmaci M, Zheng J, Jemiyo C, He Y, Özgür A, Hur J. Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text. BIOINFORMATICS ADVANCES 2024; 4:vbae133. [PMID: 39319026 PMCID: PMC11419952 DOI: 10.1093/bioadv/vbae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 08/16/2024] [Accepted: 09/09/2024] [Indexed: 09/26/2024]
Abstract
Motivation Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. As biomedical literature continues to grow rapidly, there is an increasing need for automated and accurate extraction of these interactions to facilitate scientific discovery. Pretrained language models, such as generative pretrained transformers and bidirectional encoder representations from transformers, have shown promising results in natural language processing tasks. Results We evaluated the performance of PPI identification using multiple transformer-based models across three manually curated gold-standard corpora: Learning Language in Logic with 164 interactions in 77 sentences, Human Protein Reference Database with 163 interactions in 145 sentences, and Interaction Extraction Performance Assessment with 335 interactions in 486 sentences. Models based on bidirectional encoder representations achieved the best overall performance, with BioBERT achieving the highest recall of 91.95% and F1 score of 86.84% on the Learning Language in Logic dataset. Despite not being explicitly trained for biomedical texts, GPT-4 showed commendable performance, comparable to the bidirectional encoder models. Specifically, GPT-4 achieved the highest precision of 88.37%, a recall of 85.14%, and an F1 score of 86.49% on the same dataset. These results suggest that GPT-4 can effectively detect protein interactions from text, offering valuable applications in mining biomedical literature. Availability and implementation The source code and datasets used in this study are available at https://github.com/hurlab/PPI-GPT-BERT.
Collapse
Affiliation(s)
- Hasin Rehana
- Department of Computer Science, School of Electrical Engineering & Computer Science, University of North Dakota, Grand Forks, ND 58202, United States
- Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States
| | - Nur Bengisu Çam
- Department of Computer Engineering, Bogazici University, Istanbul 34342, Turkey
| | - Mert Basmaci
- Department of Computer Engineering, Bogazici University, Istanbul 34342, Turkey
| | - Jie Zheng
- Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI 48109, United States
| | - Christianah Jemiyo
- Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States
| | - Yongqun He
- Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI 48109, United States
- Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, United States
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, Istanbul 34342, Turkey
| | - Junguk Hur
- Department of Biomedical Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, United States
| |
Collapse
|
4
|
Cai J, Han R, Li J, Hao J, Zhao Z, Jing D. Exploring mechanobiology network of bone and dental tissue based on Natural Language Processing. J Biomech 2024; 174:112271. [PMID: 39159585 DOI: 10.1016/j.jbiomech.2024.112271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 07/04/2024] [Accepted: 08/08/2024] [Indexed: 08/21/2024]
Abstract
Bone and cartilage tissues are physiologically dynamic organs that are systematically regulated by mechanical inputs. At cellular level, mechanical stimulation engages an intricate network where mechano-sensors and transmitters cooperate to manipulate downstream signaling. Despite accumulating evidence, there is a notable underutilization of available information, due to limited integration and analysis. In this context, we conceived an interactive web tool named MechanoBone to introduce a new avenue of literature-based discovery. Initially, we compiled a literature database by sourcing content from Pubmed and processing it through the Natural Language Toolkit project, Pubtator, and a custom library. We identified direct co-occurrence among entities based on existing evidence, archiving in a relational database via SQLite. Latent connections were then quantified by leveraging the Link Prediction algorithm. Secondly, mechanobiological pathway maps were generated, and an entity-pathway correlation scoring system was established through weighted algorithm based on our database, String, and KEGG, predicting potential functions of specific entities. Additionally, we established a mechanical circumstance-based exploration to sort genes by their relevance based on big data, revealing the potential mechanically sensitive factors in bone research and future clinical applications. In conclusion, MechanoBone enables: 1) interpreting mechanobiological processes; 2) identifying correlations and crosstalk among molecules and pathways under specific mechanical conditions; 3) connecting clinical applications with mechanobiological processes in bone research. It offers a literature mining tool with visualization and interactivity, facilitating targeted molecule navigation and prediction within the mechanobiological framework of bone-related cells, thereby enhancing knowledge sharing and big data analysis in the biomedical realm.
Collapse
Affiliation(s)
- Jingyi Cai
- State Key Laboratory of Oral Diseases & National Clinical Research Center for Oral Diseases, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China.
| | - RuiYing Han
- State Key Laboratory of Oral Diseases & National Clinical Research Center for Oral Diseases, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China.
| | - Junfu Li
- Glagow College, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Jin Hao
- State Key Laboratory of Oral Diseases & National Clinical Research Center for Oral Diseases, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China; ChohoTech Inc., Hangzhou 311100, China.
| | - Zhihe Zhao
- State Key Laboratory of Oral Diseases & National Clinical Research Center for Oral Diseases, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China.
| | - Dian Jing
- Department of Orthodontics, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, College of Stomatology, Shanghai Jiao Tong University, National Center for Stomatology, National Clinical Research Center for Oral Diseases, Shanghai Key Laboratory of Stomatology, Shanghai 200011, China.
| |
Collapse
|
5
|
Jadhav A, Kumar T, Raghavendra M, Loganathan T, Narayanan M. Predicting cross-tissue hormone-gene relations using balanced word embeddings. Bioinformatics 2022; 38:4771-4781. [PMID: 36000859 PMCID: PMC9563690 DOI: 10.1093/bioinformatics/btac578] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 07/29/2022] [Accepted: 08/23/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Inter-organ/inter-tissue communication is central to multi-cellular organisms including humans, and mapping inter-tissue interactions can advance system-level whole-body modeling efforts. Large volumes of biomedical literature have fostered studies that map within-tissue or tissue-agnostic interactions, but literature-mining studies that infer inter-tissue relations, such as between hormones and genes are solely missing. RESULTS We present a first study to predict from biomedical literature the hormone-gene associations mediating inter-tissue signaling in the human body. Our BioEmbedS* models use neural network-based Biomedical word Embeddings with a Support Vector Machine classifier to predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone's production or response. Model training relies on our unified dataset Hormone-Gene version 1 of ground-truth associations between genes and endocrine hormones, which we compiled and carefully balanced in the embedded space to handle data disparities, such as between poorly- versus well-studied hormones. Our BioEmbedS model recapitulates known gene mediators of tissue-tissue signaling with 70.4% accuracy; predicts novel inter-tissue communication genes in humans, which are enriched for hormone-related disorders; and generalizes well to mouse, thereby holding promise for its extension to other multi-cellular organisms as well. AVAILABILITY AND IMPLEMENTATION Freely available at https://cross-tissue-signaling.herokuapp.com are our model predictions & datasets; https://github.com/BIRDSgroup/BioEmbedS has all relevant code. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aditya Jadhav
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
| | - Tarun Kumar
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| | - Mohit Raghavendra
- Department of Information Technology, National Institute of Technology Karnataka, Surathkal, India
| | - Tamizhini Loganathan
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| | - Manikandan Narayanan
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| |
Collapse
|
6
|
Le TD, Nguyen PD, Korkin D, Thieu T. PHILM2Web: A high-throughput database of macromolecular host–pathogen interactions on the Web. Database (Oxford) 2022; 2022:6625823. [PMID: 35776535 PMCID: PMC9248916 DOI: 10.1093/database/baac042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 04/27/2022] [Accepted: 05/31/2022] [Indexed: 12/02/2022]
Abstract
During infection, the pathogen’s entry into the host organism, breaching the host immune defense, spread and multiplication are frequently mediated by multiple interactions between the host and pathogen proteins. Systematic studying of host–pathogen interactions (HPIs) is a challenging task for both experimental and computational approaches and is critically dependent on the previously obtained knowledge about these interactions found in the biomedical literature. While several HPI databases exist that manually filter HPI protein–protein interactions from the generic databases and curated experimental interactomic studies, no comprehensive database on HPIs obtained from the biomedical literature is currently available. Here, we introduce a high-throughput literature-mining platform for extracting HPI data that includes the most comprehensive to date collection of HPIs obtained from the PubMed abstracts. Our HPI data portal, PHILM2Web (Pathogen–Host Interactions by Literature Mining on the Web), integrates an automatically generated database of interactions extracted by PHILM, our high-precision HPI literature-mining algorithm. Currently, the database contains 23 581 generic HPIs between 157 host and 403 pathogen organisms from 11 609 abstracts. The interactions were obtained from processing 608 972 PubMed abstracts, each containing mentions of at least one host and one pathogen organisms. In response to the coronavirus disease 2019 (COVID-19) pandemic, we also utilized PHILM to process 25 796 PubMed abstracts obtained by the same query as the COVID-19 Open Research Dataset. This COVID-19 processing batch resulted in 257 HPIs between 19 host and 31 pathogen organisms from 167 abstracts. The access to the entire HPI dataset is available via a searchable PHILM2Web interface; scientists can also download the entire database in bulk for offline processing. Database URL: http://philm2web.live
Collapse
Affiliation(s)
- Tuan-Dung Le
- Department of Computer Science, Oklahoma State University , Stillwater, OK, USA
| | - Phuong D Nguyen
- Department of Biochemistry and Molecular Biology, Oklahoma State University , Stillwater, OK, USA
| | - Dmitry Korkin
- Department of Computer Science and Bioinformatics and Computational Biology Program, Worcester Polytechnic Institute , Worcester, MA, USA
| | - Thanh Thieu
- Machine Learning Department, Moffitt Cancer Center and Research Institute , Tampa, FL, USA
| |
Collapse
|
7
|
Sharma VS, Fossati A, Ciuffa R, Buljan M, Williams EG, Chen Z, Shao W, Pedrioli PGA, Purcell AW, Martínez MR, Song J, Manica M, Aebersold R, Li C. PCfun: a hybrid computational framework for systematic characterization of protein complex function. Brief Bioinform 2022; 23:6611913. [PMID: 35724564 PMCID: PMC9310514 DOI: 10.1093/bib/bbac239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 05/05/2022] [Accepted: 05/21/2022] [Indexed: 11/14/2022] Open
Abstract
In molecular biology, it is a general assumption that the ensemble of expressed molecules, their activities and interactions determine biological function, cellular states and phenotypes. Stable protein complexes—or macromolecular machines—are, in turn, the key functional entities mediating and modulating most biological processes. Although identifying protein complexes and their subunit composition can now be done inexpensively and at scale, determining their function remains challenging and labor intensive. This study describes Protein Complex Function predictor (PCfun), the first computational framework for the systematic annotation of protein complex functions using Gene Ontology (GO) terms. PCfun is built upon a word embedding using natural language processing techniques based on 1 million open access PubMed Central articles. Specifically, PCfun leverages two approaches for accurately identifying protein complex function, including: (i) an unsupervised approach that obtains the nearest neighbor (NN) GO term word vectors for a protein complex query vector and (ii) a supervised approach using Random Forest (RF) models trained specifically for recovering the GO terms of protein complex queries described in the CORUM protein complex database. PCfun consolidates both approaches by performing a hypergeometric statistical test to enrich the top NN GO terms within the child terms of the GO terms predicted by the RF models. The documentation and implementation of the PCfun package are available at https://github.com/sharmavaruns/PCfun. We anticipate that PCfun will serve as a useful tool and novel paradigm for the large-scale characterization of protein complex function.
Collapse
Affiliation(s)
- Varun S Sharma
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
| | - Andrea Fossati
- Quantitative Biosciences Institute (QBI) and Department of Cellular and Molecular Pharmacology, University of California, San Francisco, CA 94158, USA.,J. David Gladstone Institutes, San Francisco, CA 94158, USA
| | - Rodolfo Ciuffa
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
| | - Marija Buljan
- Empa - Swiss Federal Laboratories for Materials Science and Technology, St. Gallen, Switzerland.,Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Evan G Williams
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette Luxembourg
| | - Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Wenguang Shao
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
| | - Patrick G A Pedrioli
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland
| | - Anthony W Purcell
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | | | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | | | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,Faculty of Science, University of Zurich, Switzerland
| | - Chen Li
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Switzerland.,Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
8
|
Lou P, Dong Y, Jimeno Yepes A, Li C. A representation model for biological entities by fusing structured axioms with unstructured texts. Bioinformatics 2021; 37:1156-1163. [PMID: 33107905 DOI: 10.1093/bioinformatics/btaa913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 09/04/2020] [Accepted: 10/13/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Structured semantic resources, for example, biological knowledge bases and ontologies, formally define biological concepts, entities and their semantic relationships, manifested as structured axioms and unstructured texts (e.g. textual definitions). The resources contain accurate expressions of biological reality and have been used by machine-learning models to assist intelligent applications like knowledge discovery. The current methods use both the axioms and definitions as plain texts in representation learning (RL). However, since the axioms are machine-readable while the natural language is human-understandable, difference in meaning of token and structure impedes the representations to encode desirable biological knowledge. RESULTS We propose ERBK, a RL model of bio-entities. Instead of using the axioms and definitions as a textual corpus, our method uses knowledge graph embedding method and deep convolutional neural models to encode the axioms and definitions respectively. The representations could not only encode more underlying biological knowledge but also be further applied to zero-shot circumstance where existing approaches fall short. Experimental evaluations show that ERBK outperforms the existing methods for predicting protein-protein interactions and gene-disease associations. Moreover, it shows that ERBK still maintains promising performance under the zero-shot circumstance. We believe the representations and the method have certain generality and could extend to other types of bio-relation. AVAILABILITY AND IMPLEMENTATION The source code is available at the gitlab repository https://gitlab.com/BioAI/erbk. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peiliang Lou
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,Key Laboratory of Intelligent Networks and Network Security (Xi'an Jiaotong University), Ministry of Education, Xi'an, Shaanxi 710049, China
| | - YuXin Dong
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | | | - Chen Li
- School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| |
Collapse
|
9
|
Badal VD, Kundrotas PJ, Vakser IA. Text mining for modeling of protein complexes enhanced by machine learning. Bioinformatics 2021; 37:497-505. [PMID: 32960948 PMCID: PMC8088328 DOI: 10.1093/bioinformatics/btaa823] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 09/04/2020] [Accepted: 09/08/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. RESULTS We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. AVAILABILITYAND IMPLEMENTATION The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Ilya A Vakser
- Computational Biology Program.,Department of Molecular Biosciences, The University of Kansas, Lawrence, KS 66045, USA
| |
Collapse
|
10
|
Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open
Abstract
The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| | - Chang Su
- Division of Health Informatics, Department of Healthcare Policy and Research at Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) at National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| |
Collapse
|
11
|
Azer K, Kaddi CD, Barrett JS, Bai JPF, McQuade ST, Merrill NJ, Piccoli B, Neves-Zaph S, Marchetti L, Lombardo R, Parolo S, Immanuel SRC, Baliga NS. History and Future Perspectives on the Discipline of Quantitative Systems Pharmacology Modeling and Its Applications. Front Physiol 2021; 12:637999. [PMID: 33841175 PMCID: PMC8027332 DOI: 10.3389/fphys.2021.637999] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Accepted: 01/25/2021] [Indexed: 12/24/2022] Open
Abstract
Mathematical biology and pharmacology models have a long and rich history in the fields of medicine and physiology, impacting our understanding of disease mechanisms and the development of novel therapeutics. With an increased focus on the pharmacology application of system models and the advances in data science spanning mechanistic and empirical approaches, there is a significant opportunity and promise to leverage these advancements to enhance the development and application of the systems pharmacology field. In this paper, we will review milestones in the evolution of mathematical biology and pharmacology models, highlight some of the gaps and challenges in developing and applying systems pharmacology models, and provide a vision for an integrated strategy that leverages advances in adjacent fields to overcome these challenges.
Collapse
Affiliation(s)
- Karim Azer
- Quantitative Sciences, Bill and Melinda Gates Medical Research Institute, Cambridge, MA, United States
| | - Chanchala D. Kaddi
- Quantitative Sciences, Bill and Melinda Gates Medical Research Institute, Cambridge, MA, United States
| | | | - Jane P. F. Bai
- Office of Clinical Pharmacology, Center for Drug Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD, United States
| | - Sean T. McQuade
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Nathaniel J. Merrill
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Benedetto Piccoli
- Department of Mathematical Sciences and Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, United States
| | - Susana Neves-Zaph
- Translational Disease Modeling, Data and Data Science, Sanofi, Bridgewater, NJ, United States
| | - Luca Marchetti
- Fondazione the Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Rosario Lombardo
- Fondazione the Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | - Silvia Parolo
- Fondazione the Microsoft Research – University of Trento Centre for Computational and Systems Biology (COSBI), Rovereto, Italy
| | | | | |
Collapse
|
12
|
Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung PY, Zhao T, He Z, Zhang J. Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics 2020; 21:773. [PMID: 33167858 PMCID: PMC7654050 DOI: 10.1186/s12864-020-07185-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 10/26/2020] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.
Collapse
Affiliation(s)
- Jinchan Qu
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Albert Steppi
- Laboratory of Systems Pharmacology at Harvard Medical School, Boston, MA, 02115, USA
| | - Dongrui Zhong
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Jie Hao
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Jian Wang
- CloudMedx, Palo Alto, CA, 94301, USA
| | - Pei-Yau Lung
- Verisk - Insurance Solutions, Middletown, CT, 06457, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL, 32306, USA
| | - Zhe He
- College of Communication and Information, Florida State University, Tallahassee, FL, 32306, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA.
| |
Collapse
|
13
|
Deng Z, Yin K, Bao Y, Armengol VD, Wang C, Tiwari A, Barzilay R, Parmigiani G, Braun D, Hughes KS. Validation of a Semiautomated Natural Language Processing-Based Procedure for Meta-Analysis of Cancer Susceptibility Gene Penetrance. JCO Clin Cancer Inform 2020; 3:1-9. [PMID: 31419182 DOI: 10.1200/cci.19.00043] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Quantifying the risk of cancer associated with pathogenic mutations in germline cancer susceptibility genes-that is, penetrance-enables the personalization of preventive management strategies. Conducting a meta-analysis is the best way to obtain robust risk estimates. We have previously developed a natural language processing (NLP) -based abstract classifier which classifies abstracts as relevant to penetrance, prevalence of mutations, both, or neither. In this work, we evaluate the performance of this NLP-based procedure. MATERIALS AND METHODS We compared the semiautomated NLP-based procedure, which involves automated abstract classification and text mining, followed by human review of identified studies, with the traditional procedure that requires human review of all studies. Ten high-quality gene-cancer penetrance meta-analyses spanning 16 gene-cancer associations were used as the gold standard by which to evaluate the performance of our procedure. For each meta-analysis, we evaluated the number of abstracts that required human review (workload) and the ability to identify the studies that were included by the authors in their quantitative analysis (coverage). RESULTS Compared with the traditional procedure, the semiautomated NLP-based procedure led to a lower workload across all 10 meta-analyses, with an overall 84% reduction (2,774 abstracts v 16,941 abstracts) in the amount of human review required. Overall coverage was 93%-we are able to identify 132 of 142 studies-before reviewing references of identified studies. Reasons for the 10 missed studies included blank and poorly written abstracts. After reviewing references, nine of the previously missed studies were identified and coverage improved to 99% (141 of 142 studies). CONCLUSION We demonstrated that an NLP-based procedure can significantly reduce the review workload without compromising the ability to identify relevant studies. NLP algorithms have promising potential for reducing human efforts in the literature review process.
Collapse
Affiliation(s)
| | - Kanhua Yin
- Massachusetts General Hospital, Boston, MA
| | - Yujia Bao
- Massachusetts Institute of Technology, Boston, MA
| | | | - Cathy Wang
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | | | | | - Giovanni Parmigiani
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Danielle Braun
- Harvard TH Chan School of Public Health, Boston, MA.,Dana-Farber Cancer Institute, Boston, MA
| | - Kevin S Hughes
- Massachusetts General Hospital, Boston, MA.,Harvard Medical School, Boston, MA
| |
Collapse
|
14
|
A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories. NAT MACH INTELL 2020. [DOI: 10.1038/s42256-020-0189-y] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
15
|
Poverennaya EV, Kiseleva OI, Ivanov AS, Ponomarenko EA. Methods of Computational Interactomics for Investigating Interactions of Human Proteoforms. BIOCHEMISTRY (MOSCOW) 2020; 85:68-79. [PMID: 32079518 DOI: 10.1134/s000629792001006x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Human genome contains ca. 20,000 protein-coding genes that could be translated into millions of unique protein species (proteoforms). Proteoforms coded by a single gene often have different functions, which implies different protein partners. By interacting with each other, proteoforms create a network reflecting the dynamics of cellular processes in an organism. Perturbations of protein-protein interactions change the network topology, which often triggers pathological processes. Studying proteoforms is a relatively new research area in proteomics, and this is why there are comparatively few experimental studies on the interaction of proteoforms. Bioinformatics tools can facilitate such studies by providing valuable complementary information to the experimental data and, in particular, expanding the possibilities of the studies of proteoform interactions.
Collapse
Affiliation(s)
| | - O I Kiseleva
- Institute of Biomedical Chemistry, Moscow, 119121, Russia
| | - A S Ivanov
- Institute of Biomedical Chemistry, Moscow, 119121, Russia
| | | |
Collapse
|
16
|
Lou P, Jimeno Yepes A, Zhang Z, Zheng Q, Zhang X, Li C. BioNorm: deep learning-based event normalization for the curation of reaction databases. Bioinformatics 2020; 36:611-620. [PMID: 31350561 DOI: 10.1093/bioinformatics/btz571] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2019] [Revised: 06/27/2019] [Accepted: 07/19/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION A biochemical reaction, bio-event, depicts the relationships between participating entities. Current text mining research has been focusing on identifying bio-events from scientific literature. However, rare efforts have been dedicated to normalize bio-events extracted from scientific literature with the entries in the curated reaction databases, which could disambiguate the events and further support interconnecting events into biologically meaningful and complete networks. RESULTS In this paper, we propose BioNorm, a novel method of normalizing bio-events extracted from scientific literature to entries in the bio-molecular reaction database, e.g. IntAct. BioNorm considers event normalization as a paraphrase identification problem. It represents an entry as a natural language statement by combining multiple types of information contained in it. Then, it predicts the semantic similarity between the natural language statement and the statements mentioning events in scientific literature using a long short-term memory recurrent neural network (LSTM). An event will be normalized to the entry if the two statements are paraphrase. To the best of our knowledge, this is the first attempt of event normalization in the biomedical text mining. The experiments have been conducted using the molecular interaction data from IntAct. The results demonstrate that the method could achieve F-score of 0.87 in normalizing event-containing statements. AVAILABILITY AND IMPLEMENTATION The source code is available at the gitlab repository https://gitlab.com/BioAI/leen and BioASQvec Plus is available on figshare https://figshare.com/s/45896c31d10c3f6d857a.
Collapse
Affiliation(s)
- Peiliang Lou
- Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,Key Laboratory of Intelligent Networks and Network Security (Xi'an Jiaotong University), Ministry of Education, Xi'an, Shaanxi 710049, China
| | | | - Zai Zhang
- Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Qinghua Zheng
- Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Xiangrong Zhang
- Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Artificial Intelligence, Xidian University, Xi'an, 710071, China
| | - Chen Li
- Department of Computer Science and Technology, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China.,National Engineering Lab for Big Data Analytics, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| |
Collapse
|
17
|
|
18
|
Caufield JH, Ping P. New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data. Emerg Top Life Sci 2019; 3:357-369. [PMID: 33523203 DOI: 10.1042/etls20190003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 07/11/2019] [Accepted: 07/16/2019] [Indexed: 12/14/2022]
Abstract
Protein-protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein-protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.
Collapse
Affiliation(s)
- J Harry Caufield
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| | - Peipei Ping
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Medicine/Cardiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Bioinformatics, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Scalable Analytics Institute (ScAi), University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| |
Collapse
|
19
|
Lung PY, He Z, Zhao T, Yu D, Zhang J. Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering. Database (Oxford) 2019; 2019:5280305. [PMID: 30624652 PMCID: PMC6323317 DOI: 10.1093/database/bay138] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 12/04/2018] [Accepted: 12/06/2018] [Indexed: 12/14/2022]
Abstract
Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.
Collapse
Affiliation(s)
- Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, FL, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL, USA
| | - Disa Yu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| |
Collapse
|
20
|
He Z, Tao C, Bian J, Zhang R, Huang J. Introduction: selected extended articles from the 2nd International Workshop on Semantics-Powered Data Analytics (SEPDA 2017). BMC Med Inform Decis Mak 2018; 18:56. [PMID: 30066636 PMCID: PMC6069756 DOI: 10.1186/s12911-018-0624-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
In this editorial, we first summarize the 2nd International Workshop on Semantics-Powered Data Analytics (SEPDA 2017) held on November 13, 2017 in Kansas City, Missouri, U.S.A., and then briefly introduce 13 research articles included in this supplement issue, covering topics such as Semantic Integration, Deep Learning, Knowledge Base Construction, and Natural Language Processing.
Collapse
Affiliation(s)
- Zhe He
- School of Information, Florida State University, 142 Collegiate Loop, Tallahassee, 32306, FL, USA.
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
| | - Rui Zhang
- Institute for Health Informatics and College of Pharmacy, University of Minnesota, Minneapolis, MN, USA
| | - Jingshan Huang
- School of Computing, University of South Alabama, Mobile, AL, USA
| |
Collapse
|