1
|
Erdengasileng A, Han Q, Zhao T, Tian S, Sui X, Li K, Wang W, Wang J, Hu T, Pan F, Zhang Y, Zhang J. Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification. Database (Oxford) 2022; 2022:baac066. [PMID: 35962559 PMCID: PMC9375052 DOI: 10.1093/database/baac066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/29/2022] [Accepted: 08/09/2022] [Indexed: 11/19/2022]
Abstract
Large volumes of publications are being produced in biomedical sciences nowadays with ever-increasing speed. To deal with the large amount of unstructured text data, effective natural language processing (NLP) methods need to be developed for various tasks such as document classification and information extraction. BioCreative Challenge was established to evaluate the effectiveness of information extraction methods in biomedical domain and facilitate their development as a community-wide effort. In this paper, we summarize our work and what we have learned from the latest round, BioCreative Challenge VII, where we participated in all five tracks. Overall, we found three key components for achieving high performance across a variety of NLP tasks: (1) pre-trained NLP models; (2) data augmentation strategies and (3) ensemble modelling. These three strategies need to be tailored towards the specific tasks at hands to achieve high-performing baseline models, which are usually good enough for practical applications. When further combined with task-specific methods, additional improvements (usually rather small) can be achieved, which might be critical for winning competitions. Database URL: https://doi.org/10.1093/database/baac066.
Collapse
Affiliation(s)
| | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL 32306, USA
| | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Xin Sui
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Wanjing Wang
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Jian Wang
- Cloudmedx Inc, Palo Alto, CA 94301, USA
| | - Ting Hu
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Feng Pan
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306, USA
| |
Collapse
|
2
|
AlSaieedi A, Salhi A, Tifratene F, Raies AB, Hungler A, Uludag M, Van Neste C, Bajic VB, Gojobori T, Essack M. DES-Tcell is a knowledgebase for exploring immunology-related literature. Sci Rep 2021; 11:14344. [PMID: 34253812 PMCID: PMC8275784 DOI: 10.1038/s41598-021-93809-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 06/24/2021] [Indexed: 12/02/2022] Open
Abstract
T-cells are a subtype of white blood cells circulating throughout the body, searching for infected and abnormal cells. They have multifaceted functions that include scanning for and directly killing cells infected with intracellular pathogens, eradicating abnormal cells, orchestrating immune response by activating and helping other immune cells, memorizing encountered pathogens, and providing long-lasting protection upon recurrent infections. However, T-cells are also involved in immune responses that result in organ transplant rejection, autoimmune diseases, and some allergic diseases. To support T-cell research, we developed the DES-Tcell knowledgebase (KB). This KB incorporates text- and data-mined information that can expedite retrieval and exploration of T-cell relevant information from the large volume of published T-cell-related research. This KB enables exploration of data through concepts from 15 topic-specific dictionaries, including immunology-related genes, mutations, pathogens, and pathways. We developed three case studies using DES-Tcell, one of which validates effective retrieval of known associations by DES-Tcell. The second and third case studies focuses on concepts that are common to Grave’s disease (GD) and Hashimoto’s thyroiditis (HT). Several reports have shown that up to 20% of GD patients treated with antithyroid medication develop HT, thus suggesting a possible conversion or shift from GD to HT disease. DES-Tcell found miR-4442 links to both GD and HT, and that miR-4442 possibly targets the autoimmune disease risk factor CD6, which provides potential new knowledge derived through the use of DES-Tcell. According to our understanding, DES-Tcell is the first KB dedicated to exploring T-cell-relevant information via literature-mining, data-mining, and topic-specific dictionaries.
Collapse
Affiliation(s)
- Ahdab AlSaieedi
- Department of Medical Laboratory Technology (MLT), Faculty of Applied Medical Sciences (FAMS), King Abdulaziz University (KAU), Jeddah, 21589-80324, Saudi Arabia
| | - Adil Salhi
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Faroug Tifratene
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Arwa Bin Raies
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Arnaud Hungler
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computer, Electrical, and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
3
|
Qu J, Steppi A, Zhong D, Hao J, Wang J, Lung PY, Zhao T, He Z, Zhang J. Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach. BMC Genomics 2020; 21:773. [PMID: 33167858 PMCID: PMC7654050 DOI: 10.1186/s12864-020-07185-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Accepted: 10/26/2020] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.
Collapse
Affiliation(s)
- Jinchan Qu
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Albert Steppi
- Laboratory of Systems Pharmacology at Harvard Medical School, Boston, MA, 02115, USA
| | - Dongrui Zhong
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Jie Hao
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA
| | - Jian Wang
- CloudMedx, Palo Alto, CA, 94301, USA
| | - Pei-Yau Lung
- Verisk - Insurance Solutions, Middletown, CT, 06457, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL, 32306, USA
| | - Zhe He
- College of Communication and Information, Florida State University, Tallahassee, FL, 32306, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, 32306, USA.
| |
Collapse
|
4
|
DES-ROD: Exploring Literature to Develop New Links between RNA Oxidation and Human Diseases. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2020; 2020:5904315. [PMID: 32308806 PMCID: PMC7142358 DOI: 10.1155/2020/5904315] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Accepted: 02/21/2020] [Indexed: 12/27/2022]
Abstract
Normal cellular physiology and biochemical processes require undamaged RNA molecules. However, RNAs are frequently subjected to oxidative damage. Overproduction of reactive oxygen species (ROS) leads to RNA oxidation and disturbs redox (oxidation-reduction reaction) homeostasis. When oxidation damage affects RNA carrying protein-coding information, this may result in the synthesis of aberrant proteins as well as a lower efficiency of translation. Both of these, as well as imbalanced redox homeostasis, may lead to numerous human diseases. The number of studies on the effects of RNA oxidative damage in mammals is increasing by year due to the understanding that this oxidation fundamentally leads to numerous human diseases. To enable researchers in this field to explore information relevant to RNA oxidation and effects on human diseases, we developed DES-ROD, an online knowledgebase that contains processed information from 298,603 relevant documents that consist of PubMed abstracts and PubMed Central full-text articles. The system utilizes concepts/terms from 38 curated thematic dictionaries mapped to the analyzed documents. Researchers can explore enriched concepts, as well as enriched pairs of putatively associated concepts. In this way, one can explore mutual relationships between any combinations of two concepts from used dictionaries. Dictionaries cover a wide range of biomedical topics, such as human genes and proteins, pathways, Gene Ontology categories, mutations, noncoding RNAs, enzymes, toxins, metabolites, and diseases. This makes insights into different facets of the effects of RNA oxidation and the control of this process possible. The usefulness of the DES-ROD system is demonstrated by case studies on some known information, as well as potentially novel information involving RNA oxidation and diseases. DES-ROD is the first knowledgebase based on text and data mining that focused on the exploration of RNA oxidation and human diseases.
Collapse
|
5
|
Obradovic M, Essack M, Zafirovic S, Sudar‐Milovanovic E, Bajic VP, Van Neste C, Trpkovic A, Stanimirovic J, Bajic VB, Isenovic ER. Redox control of vascular biology. Biofactors 2020; 46:246-262. [PMID: 31483915 PMCID: PMC7187163 DOI: 10.1002/biof.1559] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Accepted: 08/14/2019] [Indexed: 12/12/2022]
Abstract
Redox control is lost when the antioxidant defense system cannot remove abnormally high concentrations of signaling molecules, such as reactive oxygen species (ROS). Chronically elevated levels of ROS cause oxidative stress that may eventually lead to cancer and cardiovascular and neurodegenerative diseases. In this review, we focus on redox effects in the vascular system. We pay close attention to the subcompartments of the vascular system (endothelium, smooth muscle cell layer) and give an overview of how redox changes influence those different compartments. We also review the core aspects of redox biology, cardiovascular physiology, and pathophysiology. Moreover, the topic-specific knowledgebase DES-RedoxVasc was used to develop two case studies, one focused on endothelial cells and the other on the vascular smooth muscle cells, as a starting point to possibly extend our knowledge of redox control in vascular biology.
Collapse
Affiliation(s)
- Milan Obradovic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE)ThuwalKingdom of Saudi Arabia
| | - Sonja Zafirovic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| | - Emina Sudar‐Milovanovic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| | - Vladan P. Bajic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE)ThuwalKingdom of Saudi Arabia
| | - Andreja Trpkovic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| | - Julijana Stanimirovic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| | - Vladimir B. Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE)ThuwalKingdom of Saudi Arabia
| | - Esma R. Isenovic
- Laboratory of Radiobiology and Molecular GeneticsVinca Institute of Nuclear Sciences, University of BelgradeBelgradeSerbia
| |
Collapse
|
6
|
Essack M, Salhi A, Stanimirovic J, Tifratene F, Bin Raies A, Hungler A, Uludag M, Van Neste C, Trpkovic A, Bajic VP, Bajic VB, Isenovic ER. Literature-Based Enrichment Insights into Redox Control of Vascular Biology. OXIDATIVE MEDICINE AND CELLULAR LONGEVITY 2019; 2019:1769437. [PMID: 31223421 PMCID: PMC6542245 DOI: 10.1155/2019/1769437] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 04/11/2019] [Accepted: 05/02/2019] [Indexed: 02/07/2023]
Abstract
In cellular physiology and signaling, reactive oxygen species (ROS) play one of the most critical roles. ROS overproduction leads to cellular oxidative stress. This may lead to an irrecoverable imbalance of redox (oxidation-reduction reaction) function that deregulates redox homeostasis, which itself could lead to several diseases including neurodegenerative disease, cardiovascular disease, and cancers. In this study, we focus on the redox effects related to vascular systems in mammals. To support research in this domain, we developed an online knowledge base, DES-RedoxVasc, which enables exploration of information contained in the biomedical scientific literature. The DES-RedoxVasc system analyzed 233399 documents consisting of PubMed abstracts and PubMed Central full-text articles related to different aspects of redox biology in vascular systems. It allows researchers to explore enriched concepts from 28 curated thematic dictionaries, as well as literature-derived potential associations of pairs of such enriched concepts, where associations themselves are statistically enriched. For example, the system allows exploration of associations of pathways, diseases, mutations, genes/proteins, miRNAs, long ncRNAs, toxins, drugs, biological processes, molecular functions, etc. that allow for insights about different aspects of redox effects and control of processes related to the vascular system. Moreover, we deliver case studies about some existing or possibly novel knowledge regarding redox of vascular biology demonstrating the usefulness of DES-RedoxVasc. DES-RedoxVasc is the first compiled knowledge base using text mining for the exploration of this topic.
Collapse
Affiliation(s)
- Magbubah Essack
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Adil Salhi
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Julijana Stanimirovic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| | - Faroug Tifratene
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Arwa Bin Raies
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Arnaud Hungler
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Andreja Trpkovic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| | - Vladan P. Bajic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| | - Vladimir B. Bajic
- King Abdullah University of Science and Technology, Computational Bioscience Research Center, Thuwal, Saudi Arabia
| | - Esma R. Isenovic
- Vinca Institute, University of Belgrade, Laboratory for Molecular Endocrinology and Radiobiology, Belgrade, Serbia
| |
Collapse
|
7
|
Lung PY, He Z, Zhao T, Yu D, Zhang J. Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering. Database (Oxford) 2019; 2019:5280305. [PMID: 30624652 PMCID: PMC6323317 DOI: 10.1093/database/bay138] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 12/04/2018] [Accepted: 12/06/2018] [Indexed: 12/14/2022]
Abstract
Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.
Collapse
Affiliation(s)
- Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, FL, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL, USA
| | - Disa Yu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| |
Collapse
|
8
|
Kordopati V, Salhi A, Razali R, Radovanovic A, Tifratene F, Uludag M, Li Y, Bokhari A, AlSaieedi A, Bin Raies A, Van Neste C, Essack M, Bajic VB. DES-Mutation: System for Exploring Links of Mutations and Diseases. Sci Rep 2018; 8:13359. [PMID: 30190574 PMCID: PMC6127254 DOI: 10.1038/s41598-018-31439-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Accepted: 08/17/2018] [Indexed: 12/17/2022] Open
Abstract
During cellular division DNA replicates and this process is the basis for passing genetic information to the next generation. However, the DNA copy process sometimes produces a copy that is not perfect, that is, one with mutations. The collection of all such mutations in the DNA copy of an organism makes it unique and determines the organism’s phenotype. However, mutations are often the cause of diseases. Thus, it is useful to have the capability to explore links between mutations and disease. We approached this problem by analyzing a vast amount of published information linking mutations to disease states. Based on such information, we developed the DES-Mutation knowledgebase which allows for exploration of not only mutation-disease links, but also links between mutations and concepts from 27 topic-specific dictionaries such as human genes/proteins, toxins, pathogens, etc. This allows for a more detailed insight into mutation-disease links and context. On a sample of 600 mutation-disease associations predicted and curated, our system achieves precision of 72.83%. To demonstrate the utility of DES-Mutation, we provide case studies related to known or potentially novel information involving disease mutations. To our knowledge, this is the first mutation-disease knowledgebase dedicated to the exploration of this topic through text-mining and data-mining of different mutation types and their associations with terms from multiple thematic dictionaries.
Collapse
Affiliation(s)
- Vasiliki Kordopati
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Adil Salhi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Rozaimi Razali
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Aleksandar Radovanovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Faroug Tifratene
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Mahmut Uludag
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Yu Li
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Ameerah Bokhari
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Ahdab AlSaieedi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,King Abdulaziz University (KAU), Faculty of Applied Medical Sciences (FAMS), Department of Medical Laboratory Technology (MLT), Jeddah, 21589-80324, Saudi Arabia
| | - Arwa Bin Raies
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Christophe Van Neste
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.,Ghent University, Center for Medical Genetics Ghent (CMGG), B-9000, Ghent, Belgium
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
9
|
Salhi A, Negrão S, Essack M, Morton MJL, Bougouffa S, Razali R, Radovanovic A, Marchand B, Kulmanov M, Hoehndorf R, Tester M, Bajic VB. DES-TOMATO: A Knowledge Exploration System Focused On Tomato Species. Sci Rep 2017; 7:5968. [PMID: 28729549 PMCID: PMC5519719 DOI: 10.1038/s41598-017-05448-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Accepted: 05/25/2017] [Indexed: 12/29/2022] Open
Abstract
Tomato is the most economically important horticultural crop used as a model to study plant biology and particularly fruit development. Knowledge obtained from tomato research initiated improvements in tomato and, being transferrable to other such economically important crops, has led to a surge of tomato-related research and published literature. We developed DES-TOMATO knowledgebase (KB) for exploration of information related to tomato. Information exploration is enabled through terms from 26 dictionaries and combination of these terms. To illustrate the utility of DES-TOMATO, we provide several examples how one can efficiently use this KB to retrieve known or potentially novel information. DES-TOMATO is free for academic and nonprofit users and can be accessed at http://cbrc.kaust.edu.sa/des_tomato/, using any of the mainstream web browsers, including Firefox, Safari and Chrome.
Collapse
Affiliation(s)
- Adil Salhi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Sónia Negrão
- King Abdullah University of Science and Technology (KAUST), Division of Biological and Environmental Sciences and Engineering, Thuwal, 23955-6900, Saudi Arabia
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Mitchell J L Morton
- King Abdullah University of Science and Technology (KAUST), Division of Biological and Environmental Sciences and Engineering, Thuwal, 23955-6900, Saudi Arabia
| | - Salim Bougouffa
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Rozaimi Razali
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Aleksandar Radovanovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | | | - Maxat Kulmanov
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia
- King Abdullah University of Science and Technology (KAUST), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Thuwal, 23955-6900, Saudi Arabia
| | - Mark Tester
- King Abdullah University of Science and Technology (KAUST), Division of Biological and Environmental Sciences and Engineering, Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955-6900, Saudi Arabia.
- King Abdullah University of Science and Technology (KAUST), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Thuwal, 23955-6900, Saudi Arabia.
| |
Collapse
|
10
|
Snider J, Kotlyar M, Saraon P, Yao Z, Jurisica I, Stagljar I. Fundamentals of protein interaction network mapping. Mol Syst Biol 2015; 11:848. [PMID: 26681426 PMCID: PMC4704491 DOI: 10.15252/msb.20156351] [Citation(s) in RCA: 201] [Impact Index Per Article: 20.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Studying protein interaction networks of all proteins in an organism (“interactomes”) remains one of the major challenges in modern biomedicine. Such information is crucial to understanding cellular pathways and developing effective therapies for the treatment of human diseases. Over the past two decades, diverse biochemical, genetic, and cell biological methods have been developed to map interactomes. In this review, we highlight basic principles of interactome mapping. Specifically, we discuss the strengths and weaknesses of individual assays, how to select a method appropriate for the problem being studied, and provide general guidelines for carrying out the necessary follow‐up analyses. In addition, we discuss computational methods to predict, map, and visualize interactomes, and provide a summary of some of the most important interactome resources. We hope that this review serves as both a useful overview of the field and a guide to help more scientists actively employ these powerful approaches in their research.
Collapse
Affiliation(s)
- Jamie Snider
- Donnelly Centre, Department of Biochemistry, Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Max Kotlyar
- Princess Margaret Cancer Center, IBM Life Sciences Discovery Centre, University Health Network, Ontario, Canada
| | - Punit Saraon
- Donnelly Centre, Department of Biochemistry, Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Zhong Yao
- Donnelly Centre, Department of Biochemistry, Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Igor Jurisica
- Princess Margaret Cancer Center, IBM Life Sciences Discovery Centre, University Health Network, Ontario, Canada
| | - Igor Stagljar
- Donnelly Centre, Department of Biochemistry, Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
11
|
Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery. Brief Bioinform 2015; 17:33-42. [PMID: 26420781 PMCID: PMC4719073 DOI: 10.1093/bib/bbv087] [Citation(s) in RCA: 103] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.
Collapse
|
12
|
Palleja A, Jensen LJ. HOODS: finding context-specific neighborhoods of proteins, chemicals and diseases. PeerJ 2015; 3:e1057. [PMID: 26157625 PMCID: PMC4493695 DOI: 10.7717/peerj.1057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2013] [Accepted: 06/05/2015] [Indexed: 11/20/2022] Open
Abstract
Clustering algorithms are often used to find groups relevant in a specific context; however, they are not informed about this context. We present a simple algorithm, HOODS, which identifies context-specific neighborhoods of entities from a similarity matrix and a list of entities specifying the context. We illustrate its applicability by finding disease-specific neighborhoods of functionally associated proteins, kinase-specific neighborhoods of structurally similar inhibitors, and physiological-system-specific neighborhoods of interconnected diseases. HOODS can be used via a simple interface at http://hoods.jensenlab.org, from where the source code can also be downloaded.
Collapse
Affiliation(s)
- Albert Palleja
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen , Copenhagen N , Denmark ; The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen , Copenhagen Ø , Denmark
| | - Lars J Jensen
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen , Copenhagen N , Denmark
| |
Collapse
|
13
|
Subramani S, Kalpana R, Monickaraj PM, Natarajan J. HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways. J Biomed Inform 2015; 54:121-31. [PMID: 25659452 DOI: 10.1016/j.jbi.2015.01.006] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2013] [Revised: 01/13/2015] [Accepted: 01/15/2015] [Indexed: 12/26/2022]
Abstract
The knowledge on protein-protein interactions (PPI) and their related pathways are equally important to understand the biological functions of the living cell. Such information on human proteins is highly desirable to understand the mechanism of several diseases such as cancer, diabetes, and Alzheimer's disease. Because much of that information is buried in biomedical literature, an automated text mining system for visualizing human PPI and pathways is highly desirable. In this paper, we present HPIminer, a text mining system for visualizing human protein interactions and pathways from biomedical literature. HPIminer extracts human PPI information and PPI pairs from biomedical literature, and visualize their associated interactions, networks and pathways using two curated databases HPRD and KEGG. To our knowledge, HPIminer is the first system to build interaction networks from literature as well as curated databases. Further, the new interactions mined only from literature and not reported earlier in databases are highlighted as new. A comparative study with other similar tools shows that the resultant network is more informative and provides additional information on interacting proteins and their associated networks.
Collapse
Affiliation(s)
- Suresh Subramani
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, School of Life Sciences, Bharathiar University, Tamil Nadu, India.
| | - Raja Kalpana
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, School of Life Sciences, Bharathiar University, Tamil Nadu, India.
| | | | - Jeyakumar Natarajan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, School of Life Sciences, Bharathiar University, Tamil Nadu, India.
| |
Collapse
|
14
|
CoCiter: an efficient tool to infer gene function by assessing the significance of literature co-citation. PLoS One 2013; 8:e74074. [PMID: 24086311 PMCID: PMC3781068 DOI: 10.1371/journal.pone.0074074] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2012] [Accepted: 07/30/2013] [Indexed: 01/17/2023] Open
Abstract
A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called "CoCiter" based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/.
Collapse
|
15
|
Adding protein context to the human protein-protein interaction network to reveal meaningful interactions. PLoS Comput Biol 2013; 9:e1002860. [PMID: 23300433 PMCID: PMC3536619 DOI: 10.1371/journal.pcbi.1002860] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2012] [Accepted: 11/09/2012] [Indexed: 01/31/2023] Open
Abstract
Interactions of proteins regulate signaling, catalysis, gene expression and many other cellular functions. Therefore, characterizing the entire human interactome is a key effort in current proteomics research. This challenge is complicated by the dynamic nature of protein-protein interactions (PPIs), which are conditional on the cellular context: both interacting proteins must be expressed in the same cell and localized in the same organelle to meet. Additionally, interactions underlie a delicate control of signaling pathways, e.g. by post-translational modifications of the protein partners - hence, many diseases are caused by the perturbation of these mechanisms. Despite the high degree of cell-state specificity of PPIs, many interactions are measured under artificial conditions (e.g. yeast cells are transfected with human genes in yeast two-hybrid assays) or even if detected in a physiological context, this information is missing from the common PPI databases. To overcome these problems, we developed a method that assigns context information to PPIs inferred from various attributes of the interacting proteins: gene expression, functional and disease annotations, and inferred pathways. We demonstrate that context consistency correlates with the experimental reliability of PPIs, which allows us to generate high-confidence tissue- and function-specific subnetworks. We illustrate how these context-filtered networks are enriched in bona fide pathways and disease proteins to prove the ability of context-filters to highlight meaningful interactions with respect to various biological questions. We use this approach to study the lung-specific pathways used by the influenza virus, pointing to IRAK1, BHLHE40 and TOLLIP as potential regulators of influenza virus pathogenicity, and to study the signalling pathways that play a role in Alzheimer's disease, identifying a pathway involving the altered phosphorylation of the Tau protein. Finally, we provide the annotated human PPI network via a web frontend that allows the construction of context-specific networks in several ways. Protein-protein-interactions (PPIs) participate in virtually all biological processes. However, the PPI map is not static but the pairs of proteins that interact depends on the type of cell, the subcellular localization and modifications of the participating proteins, among many other factors. Therefore, it is important to understand the specific conditions under which a PPI happens. Unfortunately, experimental methods often do not provide this information or, even worse, measure PPIs under artificial conditions not found in biological systems. We developed a method to infer this missing information from properties of the interacting proteins, such as in which cell types the proteins are found, which functions they fulfill and whether they are known to play a role in disease. We show that PPIs for which we can infer conditions under which they happen have a higher experimental reliability. Also, our inference agrees well with known pathways and disease proteins. Since diseases usually affect specific cell types, we study PPI networks of influenza proteins in lung tissues and of Alzheimer's disease proteins in neural tissues. In both cases, we can highlight interesting interactions potentially playing a role in disease progression.
Collapse
|
16
|
Affiliation(s)
- Nagasuma Chandra
- Indian Institute of Science, Department of Biochemistry,
Bangalore – 560012, India ,
| | - Jyothi Padiadpu
- Indian Institute of Science, Department of Biochemistry,
Bangalore – 560012, India
| |
Collapse
|