1
|
Ru X, Ye X, Sakurai T, Zou Q. Application of learning to rank in bioinformatics tasks. Brief Bioinform 2021; 22:6102666. [PMID: 33454758 DOI: 10.1093/bib/bbaa394] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 11/09/2020] [Accepted: 11/24/2020] [Indexed: 12/17/2022] Open
Abstract
Over the past decades, learning to rank (LTR) algorithms have been gradually applied to bioinformatics. Such methods have shown significant advantages in multiple research tasks in this field. Therefore, it is necessary to summarize and discuss the application of these algorithms so that these algorithms are convenient and contribute to bioinformatics. In this paper, the characteristics of LTR algorithms and their strengths over other types of algorithms are analyzed based on the application of multiple perspectives in bioinformatics. Finally, the paper further discusses the shortcomings of the LTR algorithms, the methods and means to better use the algorithms and some open problems that currently exist.
Collapse
Affiliation(s)
| | - Xiucai Ye
- Department of Computer Science and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba
| | | | - Quan Zou
- University of Electronic Science and Technology of China
| |
Collapse
|
2
|
Dolgikh E, Watson IA, Desai PV, Sawada GA, Morton S, Jones TM, Raub TJ. QSAR Model of Unbound Brain-to-Plasma Partition Coefficient, K p,uu,brain: Incorporating P-glycoprotein Efflux as a Variable. J Chem Inf Model 2016; 56:2225-2233. [PMID: 27684523 DOI: 10.1021/acs.jcim.6b00229] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
We report development and prospective validation of a QSAR model of the unbound brain-to-plasma partition coefficient, Kp,uu,brain, based on the in-house data set of ∼1000 compounds. We discuss effects of experimental variability, explore the applicability of both regression and classification approaches, and evaluate a novel, model-within-a-model approach of including P-glycoprotein efflux prediction as an additional variable. When tested on an independent test set of 91 internal compounds, incorporation of P-glycoprotein efflux information significantly improves the model performance resulting in an R2 of 0.53, RMSE of 0.57, Spearman's Rho correlation coefficient of 0.73, and qualitative prediction accuracy of 0.8 (kappa = 0.6). In addition to improving the performance, one of the key advantages of this approach is the larger chemical space coverage provided indirectly through incorporation of the in vitro, higher throughput data set that is 4 times larger than the in vivo data set.
Collapse
Affiliation(s)
- Elena Dolgikh
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| | - Ian A Watson
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| | - Prashant V Desai
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| | - Geri A Sawada
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| | - Stuart Morton
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| | - Timothy M Jones
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| | - Thomas J Raub
- Global Scientific Informatics, ‡Advanced Analytics, §Computational ADME, ∥IT Informatics and ⊥Drug Disposition, Lilly Research Laboratories, Eli Lilly and Company , Indianapolis, Indiana 46285, United States
| |
Collapse
|
3
|
Fluck J, Madan S, Ansari S, Kodamullil AT, Karki R, Rastegar-Mojarad M, Catlett NL, Hayes W, Szostak J, Hoeng J, Peitsch M. Training and evaluation corpora for the extraction of causal relationships encoded in biological expression language (BEL). DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw113. [PMID: 27554092 PMCID: PMC4995071 DOI: 10.1093/database/baw113] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Accepted: 07/07/2016] [Indexed: 01/21/2023]
Abstract
Success in extracting biological relationships is mainly dependent on the complexity of the task as well as the availability of high-quality training data. Here, we describe the new corpora in the systems biology modeling language BEL for training and testing biological relationship extraction systems that we prepared for the BioCreative V BEL track. BEL was designed to capture relationships not only between proteins or chemicals, but also complex events such as biological processes or disease states. A BEL nanopub is the smallest unit of information and represents a biological relationship with its provenance. In BEL relationships (called BEL statements), the entities are normalized to defined namespaces mainly derived from public repositories, such as sequence databases, MeSH or publicly available ontologies. In the BEL nanopubs, the BEL statements are associated with citation information and supportive evidence such as a text excerpt. To enable the training of extraction tools, we prepared BEL resources and made them available to the community. We selected a subset of these resources focusing on a reduced set of namespaces, namely, human and mouse genes, ChEBI chemicals, MeSH diseases and GO biological processes, as well as relationship types ‘increases’ and ‘decreases’. The published training corpus contains 11 000 BEL statements from over 6000 supportive text excerpts. For method evaluation, we selected and re-annotated two smaller subcorpora containing 100 text excerpts. For this re-annotation, the inter-annotator agreement was measured by the BEL track evaluation environment and resulted in a maximal F-score of 91.18% for full statement agreement. In addition, for a set of 100 BEL statements, we do not only provide the gold standard expert annotations, but also text excerpts pre-selected by two automated systems. Those text excerpts were evaluated and manually annotated as true or false supportive in the course of the BioCreative V BEL track task. Database URL:http://wiki.openbel.org/display/BIOC/Datasets
Collapse
Affiliation(s)
- Juliane Fluck
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sumit Madan
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Sam Ansari
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Alpha T Kodamullil
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | - Reagon Karki
- Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven, Sankt Augustin, Germany
| | | | | | - William Hayes
- Selventa, One Alewife Center, Cambridge, MA 02140, USA
| | - Justyna Szostak
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Julia Hoeng
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| | - Manuel Peitsch
- Philip Morris International R&D, Philip Morris Products S.A, Quai Jeanrenaud 5, Neuchâtel, 2000, Switzerland
| |
Collapse
|
4
|
Pérez-Pérez M, Pérez-Rodríguez G, Rabal O, Vazquez M, Oyarzabal J, Fdez-Riverola F, Valencia A, Krallinger M, Lourenço A. The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw120. [PMID: 27542845 PMCID: PMC5001550 DOI: 10.1093/database/baw120] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2016] [Accepted: 08/02/2016] [Indexed: 01/08/2023]
Abstract
Biomedical text mining methods and technologies have improved significantly in the last decade. Considerable efforts have been invested in understanding the main challenges of biomedical literature retrieval and extraction and proposing solutions to problems of practical interest. Most notably, community-oriented initiatives such as the BioCreative challenge have enabled controlled environments for the comparison of automatic systems while pursuing practical biomedical tasks. Under this scenario, the present work describes the Markyt Web-based document curation platform, which has been implemented to support the visualisation, prediction and benchmark of chemical and gene mention annotations at BioCreative/CHEMDNER challenge. Creating this platform is an important step for the systematic and public evaluation of automatic prediction systems and the reusability of the knowledge compiled for the challenge. Markyt was not only critical to support the manual annotation and annotation revision process but also facilitated the comparative visualisation of automated results against the manually generated Gold Standard annotations and comparative assessment of generated results. We expect that future biomedical text mining challenges and the text mining community may benefit from the Markyt platform to better explore and interpret annotations and improve automatic system predictions.Database URL: http://www.markyt.org, https://github.com/sing-group/Markyt.
Collapse
Affiliation(s)
- Martin Pérez-Pérez
- ESEI - Department of Computer Science, University of Vigo, Ourense, Spain
| | | | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| | - Miguel Vazquez
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| | | | - Alfonso Valencia
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | - Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre, Madrid, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo, Ourense, Spain Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra, Pamplona, Spain
| |
Collapse
|
5
|
Leaman R, Wei CH, Zou C, Lu Z. Mining chemical patents with an ensemble of open systems. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw065. [PMID: 27173521 PMCID: PMC4865327 DOI: 10.1093/database/baw065] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 04/11/2016] [Indexed: 11/30/2022]
Abstract
The significant amount of medicinal chemistry information contained in patents makes them an attractive target for text mining. In this manuscript, we describe systems for named entity recognition (NER) of chemicals and genes/proteins in patents, using the CEMP (for chemicals) and GPRO (for genes/proteins) corpora provided by the CHEMDNER task at BioCreative V. Our chemical NER system is an ensemble of five open systems, including both versions of tmChem, our previous work on chemical NER. Their output is combined using a machine learning classification approach. Our chemical NER system obtained 0.8752 precision and 0.9129 recall, for 0.8937 f-score on the CEMP task. Our gene/protein NER system is an extension of our previous work for gene and protein NER, GNormPlus. This system obtained a performance of 0.8143 precision and 0.8141 recall, for 0.8137 f-score on the GPRO task. Both systems achieved the highest performance in their respective tasks at BioCreative V. We conclude that an ensemble of independently-created open systems is sufficiently diverse to significantly improve performance over any individual system, even when they use a similar approach. Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/.
Collapse
Affiliation(s)
- Robert Leaman
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, USA
| | - Cherry Zou
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, USA Poolesville High School, 17501 W Wilard Rd, Poolesville, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD, USA
| |
Collapse
|
6
|
Akhondi SA, Pons E, Afzal Z, van Haagen H, Becker BFH, Hettne KM, van Mulligen EM, Kors JA. Chemical entity recognition in patents by combining dictionary-based and statistical approaches. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw061. [PMID: 27141091 PMCID: PMC4852402 DOI: 10.1093/database/baw061] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Accepted: 04/03/2016] [Indexed: 11/13/2022]
Abstract
We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small. Database URL: http://biosemantics.org/chemdner-patents
Collapse
Affiliation(s)
- Saber A Akhondi
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Ewoud Pons
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Zubair Afzal
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Herman van Haagen
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Benedikt F H Becker
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Kristina M Hettne
- Department of Human Genetics, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, PO Box 2040, 3000 CA Rotterdam
| |
Collapse
|
7
|
Wei CH, Leaman R, Lu Z. Beyond accuracy: creating interoperable and scalable text-mining web services. Bioinformatics 2016; 32:1907-10. [PMID: 26883486 DOI: 10.1093/bioinformatics/btv760] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 12/21/2015] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED The biomedical literature is a knowledge-rich resource and an important foundation for future research. With over 24 million articles in PubMed and an increasing growth rate, research in automated text processing is becoming increasingly important. We report here our recently developed web-based text mining services for biomedical concept recognition and normalization. Unlike most text-mining software tools, our web services integrate several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g. scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have preprocessed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text. AVAILABILITY AND IMPLEMENTATION Our text-mining web service is freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl CONTACT : Zhiyong.Lu@nih.gov.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA
| |
Collapse
|
8
|
Huang CC, Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform 2015; 17:132-44. [PMID: 25935162 DOI: 10.1093/bib/bbv024] [Citation(s) in RCA: 97] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Indexed: 11/13/2022] Open
Abstract
One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations.
Collapse
|