1
|
Rey CA, Danguilan JL, Mendoza KP, Remolona MF. Transformer-based approach to variable typing. Heliyon 2023; 9:e20505. [PMID: 37842594 PMCID: PMC10568320 DOI: 10.1016/j.heliyon.2023.e20505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 09/19/2023] [Accepted: 09/27/2023] [Indexed: 10/17/2023] Open
Abstract
The upsurge of multifarious endeavors across scientific fields propelled Big Data in the scientific domain. Despite the advancements in management systems, researchers find that mathematical knowledge remains one of the most challenging to manage due to the latter's inherent heterogeneity. One novel recourse being explored is variable typing where current works remain preliminary and, thus, provide a wide room for contribution. In this study, a primordial attempt to implement the end-to-end Entity Recognition (ER) and Relation Extraction (RE) approach to variable typing was made using the BERT (Bidirectional Encoder Representations from Transformers) model. A micro-dataset was developed for this process. According to our findings, the ER model and RE model, respectively, have Precision of 0.8142 and 0.4919, Recall of 0.7816 and 0.6030, and F1-Scores of 0.7975 and 0.5418. Despite the limited dataset, the models performed at par with values in the literature. This work also discusses the factors affecting this BERT-based approach, giving rise to suggestions for future implementations.
Collapse
Affiliation(s)
- Charles Arthel Rey
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Jose Lorenzo Danguilan
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Karl Patrick Mendoza
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Miguel Francisco Remolona
- Chemical Engineering Intelligence Learning Laboratory, Department of Chemical Engineering, University of the Philippines Diliman, Quezon City, 1101 Philippines
| |
Collapse
|
2
|
De Chiara F, Ferret-Miñana A, Ramón-Azcón J. The Synergy between Organ-on-a-Chip and Artificial Intelligence for the Study of NAFLD: From Basic Science to Clinical Research. Biomedicines 2021; 9:248. [PMID: 33801289 PMCID: PMC7999375 DOI: 10.3390/biomedicines9030248] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 02/20/2021] [Accepted: 02/25/2021] [Indexed: 12/15/2022] Open
Abstract
Non-alcoholic fatty liver affects about 25% of global adult population. On the long-term, it is associated with extra-hepatic compliances, multiorgan failure, and death. Various invasive and non-invasive methods are employed for its diagnosis such as liver biopsies, CT scan, MRI, and numerous scoring systems. However, the lack of accuracy and reproducibility represents one of the biggest limitations of evaluating the effectiveness of drug candidates in clinical trials. Organ-on-chips (OOC) are emerging as a cost-effective tool to reproduce in vitro the main NAFLD's pathogenic features for drug screening purposes. Those platforms have reached a high degree of complexity that generate an unprecedented amount of both structured and unstructured data that outpaced our capacity to analyze the results. The addition of artificial intelligence (AI) layer for data analysis and interpretation enables those platforms to reach their full potential. Furthermore, the use of them do not require any ethic and legal regulation. In this review, we discuss the synergy between OOC and AI as one of the most promising ways to unveil potential therapeutic targets as well as the complex mechanism(s) underlying NAFLD.
Collapse
Affiliation(s)
- Francesco De Chiara
- Biosensors for Bioengineering Group, Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology (BIST), Baldiri I Reixac 10–12, 08028 Barcelona, Spain; (A.F.-M.); (J.R.-A.)
| | - Ainhoa Ferret-Miñana
- Biosensors for Bioengineering Group, Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology (BIST), Baldiri I Reixac 10–12, 08028 Barcelona, Spain; (A.F.-M.); (J.R.-A.)
| | - Javier Ramón-Azcón
- Biosensors for Bioengineering Group, Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology (BIST), Baldiri I Reixac 10–12, 08028 Barcelona, Spain; (A.F.-M.); (J.R.-A.)
- ICREA-Institució Catalana de Recerca i Estudis Avançats, 08010 Barcelona, Spain
| |
Collapse
|
3
|
Piereck B, Oliveira-Lima M, Benko-Iseppon AM, Diehl S, Schneider R, Brasileiro-Vidal AC, Barbosa-Silva A. LAITOR4HPC: A text mining pipeline based on HPC for building interaction networks. BMC Bioinformatics 2020; 21:365. [PMID: 32838742 PMCID: PMC7447576 DOI: 10.1186/s12859-020-03620-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 06/19/2020] [Indexed: 11/11/2022] Open
Abstract
Background The amount of published full-text articles has increased dramatically. Text mining tools configure an essential approach to building biological networks, updating databases and providing annotation for new pathways. PESCADOR is an online web server based on LAITOR and NLProt text mining tools, which retrieves protein-protein co-occurrences in a tabular-based format, adding a network schema. Here we present an HPC-oriented version of PESCADOR’s native text mining tool, renamed to LAITOR4HPC, aiming to access an unlimited abstract amount in a short time to enrich available networks, build new ones and possibly highlight whether fields of research have been exhaustively studied. Results By taking advantage of parallel computing HPC infrastructure, the full collection of MEDLINE abstracts available until June 2017 was analyzed in a shorter period (6 days) when compared to the original online implementation (with an estimated 2 years to run the same data). Additionally, three case studies were presented to illustrate LAITOR4HPC usage possibilities. The first case study targeted soybean and was used to retrieve an overview of published co-occurrences in a single organism, retrieving 15,788 proteins in 7894 co-occurrences. In the second case study, a target gene family was searched in many organisms, by analyzing 15 species under biotic stress. Most co-occurrences regarded Arabidopsis thaliana and Zea mays. The third case study concerned the construction and enrichment of an available pathway. Choosing A. thaliana for further analysis, the defensin pathway was enriched, showing additional signaling and regulation molecules, and how they respond to each other in the modulation of this complex plant defense response. Conclusions LAITOR4HPC can be used for an efficient text mining based construction of biological networks derived from big data sources, such as MEDLINE abstracts. Time consumption and data input limitations will depend on the available resources at the HPC facility. LAITOR4HPC enables enough flexibility for different approaches and data amounts targeted to an organism, a subject, or a specific pathway. Additionally, it can deliver comprehensive results where interactions are classified into four types, according to their reliability.
Collapse
Affiliation(s)
- Bruna Piereck
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil
| | - Marx Oliveira-Lima
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil
| | - Ana Maria Benko-Iseppon
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil.
| | - Sarah Diehl
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, Luxembourg
| | - Reinhard Schneider
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, Luxembourg
| | - Ana Christina Brasileiro-Vidal
- Genetics Department, Laboratório de Genética e Biologia Vegetal, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil
| | - Adriano Barbosa-Silva
- University of Luxembourg, Luxembourg Centre for Systems Biomedicine, Bioinformatics Core, Esch-sur-Alzette, Luxembourg. .,Queen Mary University of London, Centre for Translational Bioinformatics, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Charterhouse Square, London, UK.
| |
Collapse
|
4
|
|
5
|
Buchkremer R, Demund A, Ebener S, Gampfer F, Jagering D, Jurgens A, Klenke S, Krimpmann D, Schmank J, Spiekermann M, Wahlers M, Wiepke M. The Application of Artificial Intelligence Technologies as a Substitute for Reading and to Support and Enhance the Authoring of Scientific Review Articles. IEEE ACCESS 2019; 7:65263-65276. [DOI: 10.1109/access.2019.2917719] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/29/2023]
|
6
|
|
7
|
Rezaeian M, Montazeri H, Loonen R. Science foresight using life-cycle analysis, text mining and clustering: A case study on natural ventilation. TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE 2017; 118:270-280. [PMID: 32287406 PMCID: PMC7126682 DOI: 10.1016/j.techfore.2017.02.027] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2015] [Revised: 02/09/2017] [Accepted: 02/21/2017] [Indexed: 06/11/2023]
Abstract
Science foresight comprises a range of methods to analyze past, present and expected research trends, and uses this information to predict the future status of different fields of science and technology. With the ability to identify high-potential development directions, science foresight can be a useful tool to support the management and planning of future research activities. Science foresight analysts can choose from a rather large variety of approaches. There is, however, relatively little information about how the various approaches can be applied in an effective way. This paper describes a three-step methodological framework for science foresight on the basis of published research papers, consisting of (i) life-cycle analysis, (ii) text mining and (iii) knowledge gap identification by means of automated clustering. The three steps are connected using the research methodology of the research papers, as identified by text mining. The potential of combining these three steps in one framework is illustrated by analyzing scientific literature on wind catchers; a natural ventilation concept which has received considerable attention from academia, but with quite low application in practice. The knowledge gaps that are identified show that the automated foresight analysis is indeed able to find uncharted research areas. Results from a sensitivity analysis further show the importance of using full-texts for text mining instead of only title, keywords and abstract. The paper concludes with a reflection on the methodological framework, and gives directions for its intended use in future studies.
Collapse
Affiliation(s)
- M. Rezaeian
- Faculty of Economics, Management & Accounting, Yazd University, Iran
| | - H. Montazeri
- Building Physics and Services, Department of the Built Environment, Eindhoven University of Technology, The Netherlands
- Building Physics Section, Department of Civil Engineering, KU Leuven, Leuven, Belgium
| | - R.C.G.M. Loonen
- Building Physics and Services, Department of the Built Environment, Eindhoven University of Technology, The Netherlands
| |
Collapse
|
8
|
Radom M, Rybarczyk A, Kottmann R, Formanowicz P, Szachniuk M, Glöckner FO, Rebholz-Schuhmann D, Błażewicz J. Poseidon: An information retrieval and extraction system for metagenomic marine science. ECOL INFORM 2012. [DOI: 10.1016/j.ecoinf.2012.07.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
9
|
PONOMARENKO JULIA, ORLOVA GALINA, MERKULOVA TATYANA, VASILIEV GENNADY, PONOMARENKO MIKHAIL. MINING GENOME VARIATION TO ASSOCIATE GENETIC DISEASE WITH MUTATION ALTERATIONS AND ORTHO/PARALOGOUS POLIMORPHYSMS IN TRANSCRIPTION FACTOR BINDING SITE. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213005002284] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We have developed a system rSNP_Guide, , predicting the transcription factor (TF) binding sites on DNA, which mutation-caused alterations may explain disease penetration. rSNP_Guide uses the detected alterations in the mutant DNA binding to unknown TF caused by diseases and, upon the DNA sequences, calculates the alterations in known TF sites so that to select only the known ones with calculated alterations in the best consistence with those detected. Our system has been control tested on the SNP's with known site-disease relationships. For practical aims, two TF sites associated with diseases were predicted and confirmed by the immune assay with anti-TF antibodies. In the case of tumor susceptibility, the GATA site in the second intron of mouse K-ras gene was truly predicted, whereas mutation damage of this site causes tumor resistance. In the case of alcohol dependencies and others behavioral diseases, the mutation-caused spurious YY1 site in the sixth intron of human tryptophan 2,3-dioxygenase (TDO2) gene was successfully predicted. Finally, sixteen non-documented TF sites localizable at both orthologous and paralogous genes were first characterized by three rates "present", "weakened" or "absent", with significance estimated by rSNP_Guide relatively to six TF sites with known mutation-caused alterations in DNA/TF-binding.
Collapse
Affiliation(s)
- JULIA PONOMARENKO
- Laboratory of Genome Structure, Institute of Cytology and Genetics, 10 Lavrentyev Ave, Novosibirsk, 630090, Russia
| | - GALINA ORLOVA
- Laboratory of Theoretical Genetics, Institute of Cytology and Genetics, 10 Lavrentyev Ave, Novosibirsk, 630090, Russia
| | - TATYANA MERKULOVA
- Laboratory of Gene Expression Regulation, Institute of Cytology and Genetics, 10 Lavrentyev Ave, Novosibirsk, 630090, Russia
| | - GENNADY VASILIEV
- Laboratory of Gene Expression Regulation, Institute of Cytology and Genetics, 10 Lavrentyev Ave, Novosibirsk, 630090, Russia
| | - MIKHAIL PONOMARENKO
- Laboratory of Theoretical Genetics, Institute of Cytology and Genetics, 10 Lavrentyev Ave, Novosibirsk, 630090, Russia
| |
Collapse
|
10
|
Fontaine JF, Priller F, Barbosa-Silva A, Andrade-Navarro MA. Génie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Res 2011; 39:W455-61. [PMID: 21609954 PMCID: PMC3125729 DOI: 10.1093/nar/gkr246] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Biomedical literature is traditionally used as a way to inform scientists of the relevance of genes in relation to a research topic. However many genes, especially from poorly studied organisms, are not discussed in the literature. Moreover, a manual and comprehensive summarization of the literature attached to the genes of an organism is in general impossible due to the high number of genes and abstracts involved. We introduce the novel Génie algorithm that overcomes these problems by evaluating the literature attached to all genes in a genome and to their orthologs according to a selected topic. Génie showed high precision (up to 100%) and the best performance in comparison to other algorithms in most of the benchmarks, especially when high sensitivity was required. Moreover, the prioritization of zebrafish genes involved in heart development, using human and mouse orthologs, showed high enrichment in differentially expressed genes from microarray experiments. The Génie web server supports hundreds of species, millions of genes and offers novel functionalities. Common run times below a minute, even when analyzing the human genome with hundreds of thousands of literature records, allows the use of Génie in routine lab work. Availability: http://cbdm.mdc-berlin.de/tools/genie/.
Collapse
Affiliation(s)
- Jean-Fred Fontaine
- Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, 13125 Berlin, Germany.
| | | | | | | |
Collapse
|
11
|
Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, Alkema W. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol 2010; 6. [PMID: 20885778 PMCID: PMC2944780 DOI: 10.1371/journal.pcbi.1000943] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/26/2010] [Indexed: 01/19/2023] Open
Abstract
The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs. The biomedical literature is an important source of knowledge on the function of genes and on the mechanisms by which these genes regulate cellular processes. Several text mining approaches have been developed to leverage this rich source of information by automatically extracting associations between concepts such as genes, diseases and drugs from a large body of text. Here, we describe a new method that extracts novel, not yet recognized associations between genes, diseases, drugs and cellular processes from the biomedical literature. Our method is built on the assumption that even if two concepts do not have a direct connection in literature, they may be functionally related if they are both connected to an overlapping set of concepts. Using this approach we predicted several novel connections between genes, diseases, drugs and pathways. Our results imply that our method is able to predict novel relationships from literature and, most importantly, that these newly identified relationships are biologically relevant. Our method can aid the drug discovery process where it can be used to find novel drug targets, increase insight in mode of action of a drug or find novel applications for known drugs.
Collapse
Affiliation(s)
- Raoul Frijters
- Computational Drug Discovery (CDD), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
| | - Marianne van Vugt
- Department of Immune Therapeutics, Schering-Plough, Oss, The Netherlands
| | - Ruben Smeets
- Department of Immune Therapeutics, Schering-Plough, Oss, The Netherlands
| | - René van Schaik
- Department of Molecular Design & Informatics, Schering-Plough, Oss, The Netherlands
| | - Jacob de Vlieg
- Computational Drug Discovery (CDD), Nijmegen Centre for Molecular Life Sciences (NCMLS), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
- Department of Molecular Design & Informatics, Schering-Plough, Oss, The Netherlands
| | - Wynand Alkema
- Department of Molecular Design & Informatics, Schering-Plough, Oss, The Netherlands
- * E-mail:
| |
Collapse
|
12
|
Blaschke C, Hoffmann R, Oliveros JC, Valencia A. Extracting information automatically from biological literature. Comp Funct Genomics 2010; 2:310-3. [PMID: 18629239 PMCID: PMC2448400 DOI: 10.1002/cfg.102] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2001] [Accepted: 07/27/2001] [Indexed: 11/13/2022] Open
Affiliation(s)
- C Blaschke
- Protein Design Group, National Center for Biotechnology, CNB-CSIC, Cantoblanco, Madrid E-28049, Spain
| | | | | | | |
Collapse
|
13
|
Barbosa-Silva A, Soldatos TG, Magalhães ILF, Pavlopoulos GA, Fontaine JF, Andrade-Navarro MA, Schneider R, Ortega JM. LAITOR--Literature Assistant for Identification of Terms co-Occurrences and Relationships. BMC Bioinformatics 2010; 11:70. [PMID: 20122157 PMCID: PMC3098111 DOI: 10.1186/1471-2105-11-70] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2009] [Accepted: 02/01/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of interest that are specific of a determined research field. Therefore, the study of the current literature about a selected topic deposited in public databases, facilitates the generation of novel hypotheses associating a set of bioentities to a common context. RESULTS We created a text mining system (LAITOR: Literature Assistant for Identification of Terms co-Occurrences and Relationships) that analyses co-occurrences of bioentities, biointeractions, and other biological terms in MEDLINE abstracts. The method accounts for the position of the co-occurring terms within sentences or abstracts. The system detected abstracts mentioning protein-protein interactions in a standard test (BioCreative II IAS test data) with a precision of 0.82-0.89 and a recall of 0.48-0.70. We illustrate the application of LAITOR to the detection of plant response genes in a dataset of 1000 abstracts relevant to the topic. CONCLUSIONS Text mining tools combining the extraction of interacting bioentities and biological concepts with network displays can be helpful in developing reasonable hypotheses in different scientific backgrounds.
Collapse
|
14
|
Frenz CM, Frenz DA. The Application of Regular Expression-Based Pattern Matching to Profiling the Developmental Factors that Contribute to the Development of the Inner Ear. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2010; 680:165-71. [DOI: 10.1007/978-1-4419-5913-3_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
15
|
David J, Irizarry KJL. Using the PubMatrix literature-mining resource to accelerate student-centered learning in a veterinary problem-based learning curriculum. JOURNAL OF VETERINARY MEDICAL EDUCATION 2009; 36:202-208. [PMID: 19625669 DOI: 10.3138/jvme.36.2.202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Problem-based learning (PBL) creates an atmosphere in which veterinary students must take responsibility for their own education. Unlike a traditional curriculum where students receive discipline-specific information by attending formal lectures, PBL is designed to elicit self-directed, student-centered learning such that each student determines (1) what he/she does not know (learning issues), (2) what he/she needs to learn, (3) how he/she will learn it, and (4) what resources he/she will use. One of the biggest challenges facing students in a PBL curriculum is efficient time management while pursuing learning issues. Bioinformatics resources, such as the PubMatrix literature-mining tool, allow access to tremendous amounts of information almost instantaneously. To accelerate student-centered learning it is necessary to include resources that enhance the rate at which students can process biomedical information. Unlike using the PubMed interface directly, the PubMatrix tool enables users to automate queries, allowing up to 1,000 distinct PubMed queries to be executed per single PubMatrix submission. Users may submit multiple PubMatrix queries per session, resulting in the ability to execute tens of thousands of PubMed queries in a single day. The intuitively organized results, which remain accessible from PubMatrix user accounts, enable students to rapidly assimilate and process hundreds of thousands of individual publication records as they relate to the student's specific learning issues and query terms. Subsequently, students can explore substantially more of the biomedical publication landscape per learning issue and spend a greater fraction of their time actively engaged in resolving their learning issues.
Collapse
Affiliation(s)
- John David
- College of Veterinary Medicine, Western University of Health Sciences, Pomona, CA 91766-1854, USA
| | | |
Collapse
|
16
|
Yoo IH, Song M. Biomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey. ACTA ACUST UNITED AC 2008. [DOI: 10.5626/jcse.2008.2.2.109] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
17
|
Fang YC, Huang HC, Juan HF. MeInfoText: associated gene methylation and cancer information from text mining. BMC Bioinformatics 2008; 9:22. [PMID: 18194557 PMCID: PMC2258285 DOI: 10.1186/1471-2105-9-22] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2007] [Accepted: 01/14/2008] [Indexed: 12/02/2022] Open
Abstract
Background DNA methylation is an important epigenetic modification of the genome. Abnormal DNA methylation may result in silencing of tumor suppressor genes and is common in a variety of human cancer cells. As more epigenetics research is published electronically, it is desirable to extract relevant information from biological literature. To facilitate epigenetics research, we have developed a database called MeInfoText to provide gene methylation information from text mining. Description MeInfoText presents comprehensive association information about gene methylation and cancer, the profile of gene methylation among human cancer types and the gene methylation profile of a specific cancer type, based on association mining from large amounts of literature. In addition, MeInfoText offers integrated protein-protein interaction and biological pathway information collected from the Internet. MeInfoText also provides pathway cluster information regarding to a set of genes which may contribute the development of cancer due to aberrant methylation. The extracted evidence with highlighted keywords and the gene names identified from each methylation-related abstract is also retrieved. The database is now available at . Conclusion MeInfoText is a unique database that provides comprehensive gene methylation and cancer association information. It will complement existing DNA methylation information and will be useful in epigenetics research and the prevention of cancer.
Collapse
Affiliation(s)
- Yu-Ching Fang
- Institute of Molecular and Cellular Biology, National Taiwan University, Taipei 106, Taiwan.
| | | | | |
Collapse
|
18
|
The ‘Open Discovery’ Challenge. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-68690-3_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
19
|
Zhou D, He Y. Extracting interactions between proteins from the literature. J Biomed Inform 2007; 41:393-407. [PMID: 18207462 DOI: 10.1016/j.jbi.2007.11.008] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2007] [Revised: 11/21/2007] [Accepted: 11/28/2007] [Indexed: 11/29/2022]
Abstract
During the last decade, biomedicine has witnessed a tremendous development. Large amounts of experimental and computational biomedical data have been generated along with new discoveries, which are accompanied by an exponential increase in the number of biomedical publications describing these discoveries. In the meantime, there has been a great interest with scientific communities in text mining tools to find knowledge such as protein-protein interactions, which is most relevant and useful for specific analysis tasks. This paper provides a outline of the various information extraction methods in biomedical domain, especially for discovery of protein-protein interactions. It surveys methodologies involved in plain texts analyzing and processing, categorizes current work in biomedical information extraction, and provides examples of these methods. Challenges in the field are also presented and possible solutions are discussed.
Collapse
Affiliation(s)
- Deyu Zhou
- Informatics Research Centre, The University of Reading, Reading, RG6 6BX, UK.
| | | |
Collapse
|
20
|
Frenz CM. Deafness mutation mining using regular expression based pattern matching. BMC Med Inform Decis Mak 2007; 7:32. [PMID: 17961241 PMCID: PMC2180167 DOI: 10.1186/1472-6947-7-32] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Accepted: 10/25/2007] [Indexed: 11/16/2022] Open
Abstract
Background While keyword based queries of databases such as Pubmed are frequently of great utility, the ability to use regular expressions in place of a keyword can often improve the results output by such databases. Regular expressions can allow for the identification of element types that cannot be readily specified by a single keyword and can allow for different words with similar character sequences to be distinguished. Results A Perl based utility was developed to allow the use of regular expressions in Pubmed searches, thereby improving the accuracy of the searches. Conclusion This utility was then utilized to create a comprehensive listing of all DFN deafness mutations discussed in Pubmed records containing the keywords "human ear".
Collapse
Affiliation(s)
- Christopher M Frenz
- Department of Computer Engineering Technology, New York City College of Technology (CUNY), 300 Jay St, Brooklyn, NY 11201, USA.
| |
Collapse
|
21
|
Hammamieh R, Chakraborty N, Wang Y, Laing M, Liu Z, Mulligan J, Jett M. GeneCite: a stand-alone open source tool for high-throughput literature and pathway mining. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2007; 11:143-51. [PMID: 17594234 DOI: 10.1089/omi.2007.4322] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Systematic extraction of relevant biological facts from available massive scientific knowledge source is emerging as a significant task for the science community. Its success depends on several key factors, including the precision of a given search, the time of its accomplishment, and the communicative prowess of the mined information to the users. GeneCite - a stand-alone Java-based high-throughput data mining tool - is designed to carry out these tasks for several important knowledge sources simultaneously, allowing the users to integrate the results and interpret biological significance in a time-efficient manner. GeneCite provides an integrated high-throughput search platform serving as an information retrieval (IR) tool for probing online literature database (PubMed) and the sequence-tagged sites' database (UniSTS), respectively. It also operates as a data retrieval (DR) tool to mine an archive of biological pathways integrated into the software itself. Furthermore, GeneCite supports a retrieved data management system (DMS) showcasing the final output in a spread-sheet format. Each cell of the output file holds a real-time connection (hyperlink) to the given online archive reachable at the users' convenience. The software is free and currently available online www.bioinformatics.org; www.wrair.army.mil/Resources.
Collapse
Affiliation(s)
- Rasha Hammamieh
- Walter Reed Army Institute of Research, Molecular Pathology, Silver Spring, Maryland 20910, USA
| | | | | | | | | | | | | |
Collapse
|
22
|
Faccioli P, Ciceri GP, Provero P, Stanca AM, Morcia C, Terzi V. A combined strategy of "in silico" transcriptome analysis and web search engine optimization allows an agile identification of reference genes suitable for normalization in gene expression studies. PLANT MOLECULAR BIOLOGY 2007; 63:679-88. [PMID: 17143578 DOI: 10.1007/s11103-006-9116-9] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2006] [Accepted: 11/12/2006] [Indexed: 05/12/2023]
Abstract
Traditionally housekeeping genes have been employed as endogenous reference (internal control) genes for normalization in gene expression studies. Since the utilization of single housekeepers cannot assure an unbiased result, new normalization methods involving multiple housekeeping genes and normalizing using their mean expression have been recently proposed. Moreover, since a gold standard gene suitable for every experimental condition does not exist, it is also necessary to validate the expression stability of every putative control gene on the specific requirements of the planned experiment. As a consequence, finding a good set of reference genes is for sure a non-trivial problem requiring quite a lot of lab-based experimental testing. In this work we identified novel candidate barley reference genes suitable for normalization in gene expression studies. An advanced web search approach aimed to collect, from publicly available web resources, the most interesting information regarding the expression profiling of candidate housekeepers on a specific experimental basis has been set up and applied, as an example, on stress conditions. A complementary lab-based analysis has been carried out to verify the expression profile of the selected genes in different tissues and during heat shock response. This combined dry/wet approach can be applied to any species and physiological condition of interest and can be considered very helpful to identify putative reference genes to be shortlisted every time a new experimental design has to be set up.
Collapse
Affiliation(s)
- Primetta Faccioli
- CRA, Experimental Institute for Cereal Research, Via S. Protaso 302, 29017, Fiorenzuola d'Arda, PC, Italy.
| | | | | | | | | | | |
Collapse
|
23
|
Chen D, Müller HM, Sternberg PW. Automatic document classification of biological literature. BMC Bioinformatics 2006; 7:370. [PMID: 16893465 PMCID: PMC1559726 DOI: 10.1186/1471-2105-7-370] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2006] [Accepted: 08/07/2006] [Indexed: 12/02/2022] Open
Abstract
Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.
Collapse
Affiliation(s)
- David Chen
- Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, USA
| | - Hans-Michael Müller
- Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, USA
| | - Paul W Sternberg
- Division of Biology and Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
24
|
Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006; 7:119-29. [PMID: 16418747 DOI: 10.1038/nrg1768] [Citation(s) in RCA: 356] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
Collapse
Affiliation(s)
- Lars Juhl Jensen
- European Molecular Biology Laboratory, D-69117 Heidelberg, Germany.
| | | | | |
Collapse
|
25
|
Goetz T, von der Lieth CW. PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts. Nucleic Acids Res 2005; 33:W774-8. [PMID: 15980583 PMCID: PMC1160190 DOI: 10.1093/nar/gki429] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Since it is becoming increasingly laborious to manually extract useful information embedded in the ever-growing volumes of literature, automated intelligent text analysis tools are becoming more and more essential to assist in this task. PubFinder (www.glycosciences.de/tools/PubFinder) is a publicly available web tool designed to improve the retrieval rate of scientific abstracts relevant for a specific scientific topic. Only the selection of a representative set of abstracts is required, which are central for a scientific topic. No special knowledge concerning the query-syntax is necessary. Based on the selected abstracts, a list of discriminating words is automatically calculated, which is subsequently used for scoring all defined PubMed abstracts for their probability of belonging to the defined scientific topic. This results in a hit-list of references in the descending order of their likelihood score. The algorithms and procedures implemented in PubFinder facilitate the perpetual task for every scientist of staying up-to-date with current publications dealing with a specific subject in biomedicine.
Collapse
|
26
|
Bajic VB, Veronika M, Veladandi PS, Meka A, Heng MW, Rajaraman K, Pan H, Swarup S. Dragon Plant Biology Explorer. A text-mining tool for integrating associations between genetic and biochemical entities with genome annotation and biochemical terms lists. PLANT PHYSIOLOGY 2005; 138:1914-25. [PMID: 16172098 PMCID: PMC1183383 DOI: 10.1104/pp.105.060863] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
We introduce a tool for text mining, Dragon Plant Biology Explorer (DPBE) that integrates information on Arabidopsis (Arabidopsis thaliana) genes with their functions, based on gene ontologies and biochemical entity vocabularies, and presents the associations as interactive networks. The associations are based on (1) user-provided PubMed abstracts; (2) a list of Arabidopsis genes compiled by The Arabidopsis Information Resource; (3) user-defined combinations of four vocabulary lists based on the ones developed by the general, plant, and Arabidopsis GO consortia; and (4) three lists developed here based on metabolic pathways, enzymes, and metabolites derived from AraCyc, BRENDA, and other metabolism databases. We demonstrate how various combinations can be applied to fields of (1) gene function and gene interaction analyses, (2) plant development, (3) biochemistry and metabolism, and (4) pharmacology of bioactive compounds. Furthermore, we show the suitability of DPBE for systems approaches by integration with "omics" platform outputs. Using a list of abiotic stress-related genes identified by microarray experiments, we show how this tool can be used to rapidly build an information base on the previously reported relationships. This tool complements the existing biological resources for systems biology by identifying potentially novel associations using text analysis between cellular entities based on genome annotation terms. Thus, it allows researchers to efficiently summarize existing information for a group of genes or pathways, so as to make better informed choices for designing validation experiments. Last, DPBE can be helpful for beginning researchers and graduate students to summarize vast information in an unfamiliar area. DPBE is freely available for academic and nonprofit users at http://research.i2r.a-star.edu.sg/DRAGON/ME2/.
Collapse
Affiliation(s)
- Vladimir B Bajic
- Knowledge Extraction Lab, Institute for Infocomm Research, Singapore 119613
| | | | | | | | | | | | | | | |
Collapse
|
27
|
Shah PK, Jensen LJ, Boué S, Bork P. Extraction of transcript diversity from scientific literature. PLoS Comput Biol 2005; 1:e10. [PMID: 16103899 PMCID: PMC1183516 DOI: 10.1371/journal.pcbi.0010010] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Accepted: 05/21/2005] [Indexed: 11/26/2022] Open
Abstract
Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/. Given the functional complexity of higher eukaryotes, the relatively small number of genes in the human and other mammalian genomes came as a surprise to the scientific community. Later it was discovered that the majority of genes are subject to alternative splicing (“cutting and pasting”) or associated mechanisms that ultimately increase the diversity of transcripts that code for proteins. Studies exploring transcript diversity are currently dominated by high-throughput experiments and computational methods; however, the quality of such data should be assessed against a reliable reference set based on single-gene studies. Unfortunately, the latter type of information is scattered throughout the scientific literature. The authors have thus developed a computational approach for extracting information on alternative transcripts from MEDLINE abstracts and used it to create a database, LSAT. LSAT (Literature Support for Alternative Transcripts) provides information for more than 4,000 genes from about 14,000 abstracts. This database can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression based on single-gene studies, which we show agrees well with EST-based studies (these studies involve tissue-specific splicing detected by the analysis of libraries of expressed sequence tags [ESTs]). These results indicate that mechanisms like alternative splicing, alternative promoters, and alternative polyadenylation work in concert to generate and regulate transcript diversity. More generally, information extraction of complex biological process seems feasible and can also complement large-scale data generation in other areas to assign functions to genes.
Collapse
Affiliation(s)
- Parantu K Shah
- Structural and Computational Biology Program, European Molecular Biology Laboratory, Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, Berlin-Buch, Germany
| | - Lars J Jensen
- Structural and Computational Biology Program, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Stéphanie Boué
- Structural and Computational Biology Program, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Peer Bork
- Structural and Computational Biology Program, European Molecular Biology Laboratory, Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, Berlin-Buch, Germany
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
28
|
Schijvenaars BJA, Mons B, Weeber M, Schuemie MJ, van Mulligen EM, Wain HM, Kors JA. Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 2005; 6:149. [PMID: 15958172 PMCID: PMC1183190 DOI: 10.1186/1471-2105-6-149] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2004] [Accepted: 06/16/2005] [Indexed: 11/28/2022] Open
Abstract
Background Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck. Results We developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set. Conclusion The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
Collapse
Affiliation(s)
- Bob JA Schijvenaars
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Barend Mons
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Marc Weeber
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Martijn J Schuemie
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Hester M Wain
- HUGO Gene Nomenclature Committee, Department of Biology, University College London, Wolfson House, 4 Stephenson Way, London NW1 2HE, UK
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center Rotterdam, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| |
Collapse
|
29
|
Abstract
BACKGROUND The development of text mining systems that annotate biological entities with their properties using scientific literature is an important recent research topic. These systems need first to recognize the biological entities and properties in the text, and then decide which pairs represent valid annotations. METHODS This document introduces a novel unsupervised method for recognizing biological properties in unstructured text, involving the evidence content of their names. RESULTS This document shows the results obtained by the application of our method to BioCreative tasks 2.1 and 2.2, where it identified Gene Ontology annotations and their evidence in a set of articles. CONCLUSION From the performance obtained in BioCreative, we concluded that an automatic annotation system can effectively use our method to identify biological properties in unstructured text.
Collapse
Affiliation(s)
- Francisco M Couto
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Portugal
| | - Mário J Silva
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Portugal
| | - Pedro M Coutinho
- Architecture et Fonction des Macromolécules Biologiques, CNRS, Marseille, France
| |
Collapse
|
30
|
Suomela BP, Andrade MA. Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics 2005; 6:75. [PMID: 15790421 PMCID: PMC1274266 DOI: 10.1186/1471-2105-6-75] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2004] [Accepted: 03/24/2005] [Indexed: 12/03/2022] Open
Abstract
Background The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine. Results We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%. Conclusion This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address .
Collapse
Affiliation(s)
- Brian P Suomela
- Ontario Genomics Innovation Centre, Ottawa Health Research Institute, 501 Smyth Rd, Ottawa, Ontario K1H 8L6, Canada
| | - Miguel A Andrade
- Ontario Genomics Innovation Centre, Ottawa Health Research Institute, 501 Smyth Rd, Ottawa, Ontario K1H 8L6, Canada
| |
Collapse
|
31
|
Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA. Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 2005; 33:1544-52. [PMID: 15767279 PMCID: PMC1065256 DOI: 10.1093/nar/gki296] [Citation(s) in RCA: 143] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Genome-wide techniques such as microarray analysis, Serial Analysis of Gene Expression (SAGE), Massively Parallel Signature Sequencing (MPSS), linkage analysis and association studies are used extensively in the search for genes that cause diseases, and often identify many hundreds of candidate disease genes. Selection of the most probable of these candidate disease genes for further empirical analysis is a significant challenge. Additionally, identifying the genes that cause complex diseases is problematic due to low penetrance of multiple contributing genes. Here, we describe a novel bioinformatic approach that selects candidate disease genes according to their expression profiles. We use the eVOC anatomical ontology to integrate text-mining of biomedical literature and data-mining of available human gene expression data. To demonstrate that our method is successful and widely applicable, we apply it to a database of 417 candidate genes containing 17 known disease genes. We successfully select the known disease gene for 15 out of 17 diseases and reduce the candidate gene set to 63.3% (±18.8%) of its original size. This approach facilitates direct association between genomic data describing gene expression and information from biomedical texts describing disease phenotype, and successfully prioritizes candidate genes according to their expression in disease-affected tissues.
Collapse
Affiliation(s)
- Nicki Tiffin
- South African National Bioinformatics Institute, University of the Western Cape Belville 7535, South Africa.
| | | | | | | | | | | |
Collapse
|
32
|
Abstract
A fundamental task of pharmacogenetics is to collect and classify relationships between genes and drugs. Currently, this useful information has not been comprehensively aggregated in any database and remains scattered throughout the published literature. Although there are efforts to collect this information manually, they are limited by the size of the published literature on gene-drug relationships. Therefore, we investigated computational methods to extract and characterize pharmacogenetic relationships between genes and drugs from the literature. We first evaluated the effectiveness of the co-occurrence method in identifying related genes and drugs. We then used supervised machine learning algorithms to classify the relationships between genes and drugs from the Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) into five categories that have been defined by active pharmacogenetic researchers as relevant to their work. The final co-occurrence algorithm was able to extract 78% of the related genes and drugs that were published in a review article from the literature. Our algorithm subsequently classified the relationships between genes and drugs from the PharmGKB into five categories with 74% accuracy. We have made the data available on a supplementary website at http://bionlp.stanford.edu/genedrug/ Gene-drug relationships can be accurately extracted from text and classified into categories. Although the relationships that we have identified do not capture the details and fine distinctions often made in the literature, these methods will help scientists to track the ever-growing literature and create information resources to support future discoveries.
Collapse
Affiliation(s)
- Jeffrey T Chang
- Department of Genetics, Stanford Biomedical Informatics, Stanford, CA 94305-5120, USA
| | | |
Collapse
|
33
|
Collier N, Takeuchi K. Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 2004; 37:423-35. [PMID: 15542016 DOI: 10.1016/j.jbi.2004.08.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2004] [Indexed: 10/26/2022]
Abstract
The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3.
Collapse
Affiliation(s)
- Nigel Collier
- National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan.
| | | |
Collapse
|
34
|
Müller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004; 2:e309. [PMID: 15383839 PMCID: PMC517822 DOI: 10.1371/journal.pbio.0020309] [Citation(s) in RCA: 415] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2003] [Accepted: 07/19/2004] [Indexed: 11/19/2022] Open
Abstract
We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org. With the increasing availability of full-text scientific papers online, new tools, such as Textpresso, will help to extract information and knowledge from research literature
Collapse
Affiliation(s)
- Hans-Michael Müller
- 1Division of Biology and Howard Hughes Medical Institute, California Institute of TechnologyPasadena, CaliforniaUnited States of America
| | - Eimear E Kenny
- 1Division of Biology and Howard Hughes Medical Institute, California Institute of TechnologyPasadena, CaliforniaUnited States of America
| | - Paul W Sternberg
- 1Division of Biology and Howard Hughes Medical Institute, California Institute of TechnologyPasadena, CaliforniaUnited States of America
| |
Collapse
|
35
|
Pan H, Zuo L, Choudhary V, Zhang Z, Leow SH, Chong FT, Huang Y, Ong VWS, Mohanty B, Tan SL, Krishnan SPT, Bajic VB. Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining. Nucleic Acids Res 2004; 32:W230-4. [PMID: 15215386 PMCID: PMC441622 DOI: 10.1093/nar/gkh484] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We present Dragon TF Association Miner (DTFAM), a system for text-mining of PubMed documents for potential functional association of transcription factors (TFs) with terms from Gene Ontology (GO) and with diseases. DTFAM has been trained and tested in the selection of relevant documents on a manually curated dataset containing >3000 PubMed abstracts relevant to transcription control. On our test data the system achieves sensitivity of 80% with specificity of 82%. DTFAM provides comprehensive tabular and graphical reports linking terms to relevant sets of documents. These documents are color-coded for easier inspection. DTFAM complements the existing biological resources by collecting, assessing, extracting and presenting associations that can reveal some of the not so easily observable connections among the entities found which could explain the functions of TFs and help decipher parts of gene transcriptional regulatory networks. DTFAM summarizes information from a large volume of documents saving time and making analysis simpler for individual users. DTFAM is freely available for academic and non-profit users at http://research.i2r.a-star.edu.sg/DRAGON/TFAM/.
Collapse
Affiliation(s)
- Hong Pan
- Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R, Casari G, Kirsch H. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res 2004; 32:135-42. [PMID: 14704350 PMCID: PMC373272 DOI: 10.1093/nar/gkh162] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Mutations help us to understand the molecular origins of diseases. Researchers, therefore, both publish and seek disease-relevant mutations in public databases and in scientific literature, e.g. Medline. The retrieval tends to be time-consuming and incomplete. Automated screening of the literature is more efficient. We developed extraction methods (called MEMA) that scan Medline abstracts for mutations. MEMA identified 24,351 singleton mutations in conjunction with a HUGO gene name out of 16,728 abstracts. From a sample of 100 abstracts we estimated the recall for the identification of mutation-gene pairs to 35% at a precision of 93%. Recall for the mutation detection alone was >67% with a precision rate of >96%. This shows that our system produces reliable data. The subset consisting of protein sequence mutations (PSMs) from MEMA was compared to the entries in OMIM (20,503 entries versus 6699, respectively). We found 1826 PSM-gene pairs to be in common to both datasets (cross-validated). This is 27% of all PSM-gene pairs in OMIM and 91% of those pairs from OMIM which co-occur in at least one Medline abstract. We conclude that Medline covers a large portion of the mutations known to OMIM. Another large portion could be artificially produced mutations from mutagenesis experiments. Access to the database of extracted mutation-gene pairs is available through the web pages of the EBI (refer to http://www.ebi. ac.uk/rebholz/index.html).
Collapse
|
37
|
Shah PK, Perez-Iratxeta C, Bork P, Andrade MA. Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 2003; 4:20. [PMID: 12775220 PMCID: PMC166134 DOI: 10.1186/1471-2105-4-20] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2003] [Accepted: 05/29/2003] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND To date, many of the methods for information extraction of biological information from scientific articles are restricted to the abstract of the article. However, full text articles in electronic version, which offer larger sources of data, are currently available. Several questions arise as to whether the effort of scanning full text articles is worthy, or whether the information that can be extracted from the different sections of an article can be relevant. RESULTS In this work we addressed those questions showing that the keyword content of the different sections of a standard scientific article (abstract, introduction, methods, results, and discussion) is very heterogeneous. CONCLUSIONS Although the abstract contains the best ratio of keywords per total of words, other sections of the article may be a better source of biologically relevant data.
Collapse
Affiliation(s)
- Parantu K Shah
- Biocomputing, European Molecular Biology Laboratory, Heidelberg, Germany
- Department of Bioinformatics, Max Delbrück Center for Molecular Medicine, Berlin-Buch, Germany
| | - Carolina Perez-Iratxeta
- Biocomputing, European Molecular Biology Laboratory, Heidelberg, Germany
- Department of Bioinformatics, Max Delbrück Center for Molecular Medicine, Berlin-Buch, Germany
| | - Peer Bork
- Biocomputing, European Molecular Biology Laboratory, Heidelberg, Germany
- Department of Bioinformatics, Max Delbrück Center for Molecular Medicine, Berlin-Buch, Germany
| | - Miguel A Andrade
- Biocomputing, European Molecular Biology Laboratory, Heidelberg, Germany
- Department of Bioinformatics, Max Delbrück Center for Molecular Medicine, Berlin-Buch, Germany
- Present address: Bioinformatics group, Ottawa Health Research Institute, Ottawa, Canada
| |
Collapse
|
38
|
Ponomarenko J, Merkulova T, Orlova G, Fokin O, Gorshkov E, Ponomarenko M. Mining DNA sequences to predict sites which mutations cause genetic diseases. Knowl Based Syst 2002. [DOI: 10.1016/s0950-7051(01)00144-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
39
|
Rabow AA, Shoemaker RH, Sausville EA, Covell DG. Mining the National Cancer Institute's tumor-screening database: identification of compounds with similar cellular activities. J Med Chem 2002; 45:818-40. [PMID: 11831894 DOI: 10.1021/jm010385b] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
In an effort to enhance access to information available in the National Cancer Institute's (NCI) anticancer drug-screening database, a new suite of Internet accessible (http://spheroid. ncifcrf.gov) computational tools has been assembled for self-organizing map-based (SOM) cluster analysis and data visualization. A range of analysis questions were initially addressed to evaluate improvements in SOM cluster quality based on the data-conditioning procedures of Z-score normalization, capping, and treatment of missing data as well as completeness of drug cell-screening data. These studies established a foundation for SOM cluster analysis of the complete set of NCI's publicly available antitumor drug-screening data. This analysis identified relationships between chemotypes of screened agents and their effect on four major classes of cellular activities: mitosis, nucleic acid synthesis, membrane transport and integrity, and phosphatase- and kinase-mediated cell cycle regulation. Validations of these cellular activities, obtained from literature sources, found (i) strong evidence supporting within cluster memberships and shared cellular activity, (ii) indications of compound selectivity between various types of cellular activity, and (iii) strengths and weaknesses of the NCI's antitumor drug screen data for assigning compounds to these classes of cellular activity. Subsequent analyses of averaged responses within these tumor panel types find a strong dependence on chemotype for coherence among cellular response patterns. The advantages of a global analysis of the complete screening data set are discussed.
Collapse
Affiliation(s)
- Alfred A Rabow
- Science Applications International Corporation and Developmental Therapeutics Program, DCTD, National Cancer Institute/NIH, Frederick, MD 21702, USA.
| | | | | | | |
Collapse
|
40
|
Mellor JC, Yanai I, Clodfelter KH, Mintseris J, DeLisi C. Predictome: a database of putative functional links between proteins. Nucleic Acids Res 2002; 30:306-9. [PMID: 11752322 PMCID: PMC99135 DOI: 10.1093/nar/30.1.306] [Citation(s) in RCA: 107] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The current deluge of genomic sequences has spawned the creation of tools capable of making sense of the data. Computational and high-throughput experimental methods for generating links between proteins have recently been emerging. These methods effectively act as hypothesis machines, allowing researchers to screen large sets of data to detect interesting patterns that can then be studied in greater detail. Although the potential use of these putative links in predicting gene function has been demonstrated, a central repository for all such links for many genomes would maximize their usefulness. Here we present Predictome, a database of predicted links between the proteins of 44 genomes based on the implementation of three computational methods--chromosomal proximity, phylogenetic profiling and domain fusion--and large-scale experimental screenings of protein-protein interaction data. The combination of data from various predictive methods in one database allows for their comparison with each other, as well as visualization of their correlation with known pathway information. As a repository for such data, Predictome is an ongoing resource for the community, providing functional relationships among proteins as new genomic data emerges. Predictome is available at http://predictome.bu.edu.
Collapse
Affiliation(s)
- Joseph C Mellor
- Bioinformatics Graduate Program and Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | | | | | | | | |
Collapse
|
41
|
Abstract
The most frequent access to the MEDLINE database of scientific abstracts is by keyword search. However, this is often not sufficient because although the user might find all the useful abstracts, these are buried in hundreds that are irrelevant. The exploratory tool XplorMed has been developed to analyse the result of any MEDLINE query. It suggests main groups of related topics and documents, sparing the user the need of reading all abstracts.
Collapse
Affiliation(s)
- C Perez-Iratxeta
- European Molecular Biology Laboratory, Meyerhofstr. 1, 69012 Heidelberg, Germany.
| | | | | |
Collapse
|
42
|
Kolesov G, Mewes HW, Frishman D. SNAPping up functionally related genes based on context information: a colinearity-free approach. J Mol Biol 2001; 311:639-56. [PMID: 11518521 DOI: 10.1006/jmbi.2001.4701] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We describe a computational approach for finding genes that are functionally related but do not possess any noticeable sequence similarity. Our method, which we call SNAP (similarity-neighborhood approach), reveals the conservation of gene order on bacterial chromosomes based on both cross-genome comparison and context information. The novel feature of this method is that it does not rely on detection of conserved colinear gene strings. Instead, we introduce the notion of a similarity-neighborhood graph (SN-graph), which is constructed from the chains of similarity and neighborhood relationships between orthologous genes in different genomes and adjacent genes in the same genome, respectively. An SN-cycle is defined as a closed path on the SN-graph and is postulated to preferentially join functionally related gene products that participate in the same biochemical or regulatory process. We demonstrate the substantial non-randomness and functional significance of SN-cycles derived from real genome data and estimate the prediction accuracy of SNAP in assigning broad function to uncharacterized proteins. Examples of practical application of SNAP for improving the quality of genome annotation are described.
Collapse
Affiliation(s)
- G Kolesov
- GSF - National Research Center for Environment and Health, Institute for Bioinformatics, Ingolstädter Landstrasse 1, Neueherberg, 85764, Germany
| | | | | |
Collapse
|
43
|
|
44
|
Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001; 28:21-8. [PMID: 11326270 DOI: 10.1038/ng0501-21] [Citation(s) in RCA: 382] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.
Collapse
Affiliation(s)
- T K Jenssen
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway
| | | | | | | |
Collapse
|
45
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447185 DOI: 10.1002/cfg.55] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
|
46
|
Data mining. Nat Biotechnol 2000. [DOI: 10.1038/80073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|