Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Andrade MA, Bork P. Automated extraction of information in molecular biology. FEBS Lett 2000;476:12-7. [PMID: 10878241 DOI: 10.1016/s0014-5793(00)01661-6] [Citation(s) in RCA: 63] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

Number

Cited by Other Article(s)

Rey CA, Danguilan JL, Mendoza KP, Remolona MF. Transformer-based approach to variable typing. Heliyon 2023;9:e20505. [PMID: 37842594 PMCID: PMC10568320 DOI: 10.1016/j.heliyon.2023.e20505] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 09/19/2023] [Accepted: 09/27/2023] [Indexed: 10/17/2023] Open

De Chiara F, Ferret-Miñana A, Ramón-Azcón J. The Synergy between Organ-on-a-Chip and Artificial Intelligence for the Study of NAFLD: From Basic Science to Clinical Research. Biomedicines 2021;9:248. [PMID: 33801289 PMCID: PMC7999375 DOI: 10.3390/biomedicines9030248] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 02/20/2021] [Accepted: 02/25/2021] [Indexed: 12/15/2022] Open

Piereck B, Oliveira-Lima M, Benko-Iseppon AM, Diehl S, Schneider R, Brasileiro-Vidal AC, Barbosa-Silva A. LAITOR4HPC: A text mining pipeline based on HPC for building interaction networks. BMC Bioinformatics 2020;21:365. [PMID: 32838742 PMCID: PMC7447576 DOI: 10.1186/s12859-020-03620-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 06/19/2020] [Indexed: 11/11/2022] Open

Abstract

Background

The amount of published full-text articles has increased dramatically. Text mining tools configure an essential approach to building biological networks, updating databases and providing annotation for new pathways. PESCADOR is an online web server based on LAITOR and NLProt text mining tools, which retrieves protein-protein co-occurrences in a tabular-based format, adding a network schema. Here we present an HPC-oriented version of PESCADOR’s native text mining tool, renamed to LAITOR4HPC, aiming to access an unlimited abstract amount in a short time to enrich available networks, build new ones and possibly highlight whether fields of research have been exhaustively studied.

Results

By taking advantage of parallel computing HPC infrastructure, the full collection of MEDLINE abstracts available until June 2017 was analyzed in a shorter period (6 days) when compared to the original online implementation (with an estimated 2 years to run the same data). Additionally, three case studies were presented to illustrate LAITOR4HPC usage possibilities. The first case study targeted soybean and was used to retrieve an overview of published co-occurrences in a single organism, retrieving 15,788 proteins in 7894 co-occurrences. In the second case study, a target gene family was searched in many organisms, by analyzing 15 species under biotic stress. Most co-occurrences regarded Arabidopsis thaliana and Zea mays. The third case study concerned the construction and enrichment of an available pathway. Choosing A. thaliana for further analysis, the defensin pathway was enriched, showing additional signaling and regulation molecules, and how they respond to each other in the modulation of this complex plant defense response.

Conclusions

LAITOR4HPC can be used for an efficient text mining based construction of biological networks derived from big data sources, such as MEDLINE abstracts. Time consumption and data input limitations will depend on the available resources at the HPC facility. LAITOR4HPC enables enough flexibility for different approaches and data amounts targeted to an organism, a subject, or a specific pathway. Additionally, it can deliver comprehensive results where interactions are classified into four types, according to their reliability.

Collapse

Frequent pattern discovery with tri-partition alphabets. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2018.04.013] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Buchkremer R, Demund A, Ebener S, Gampfer F, Jagering D, Jurgens A, Klenke S, Krimpmann D, Schmank J, Spiekermann M, Wahlers M, Wiepke M. The Application of Artificial Intelligence Technologies as a Substitute for Reading and to Support and Enhance the Authoring of Scientific Review Articles. IEEE ACCESS 2019;7:65263-65276. [DOI: 10.1109/access.2019.2917719] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/29/2023]

Past, current and future trends in enterprise architecture—A view beyond the horizon. COMPUT IND 2018. [DOI: 10.1016/j.compind.2018.03.006] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Rezaeian M, Montazeri H, Loonen R. Science foresight using life-cycle analysis, text mining and clustering: A case study on natural ventilation. TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE 2017;118:270-280. [PMID: 32287406 PMCID: PMC7126682 DOI: 10.1016/j.techfore.2017.02.027] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2015] [Revised: 02/09/2017] [Accepted: 02/21/2017] [Indexed: 06/11/2023]

Radom M, Rybarczyk A, Kottmann R, Formanowicz P, Szachniuk M, Glöckner FO, Rebholz-Schuhmann D, Błażewicz J. Poseidon: An information retrieval and extraction system for metagenomic marine science. ECOL INFORM 2012. [DOI: 10.1016/j.ecoinf.2012.07.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

PONOMARENKO JULIA, ORLOVA GALINA, MERKULOVA TATYANA, VASILIEV GENNADY, PONOMARENKO MIKHAIL. MINING GENOME VARIATION TO ASSOCIATE GENETIC DISEASE WITH MUTATION ALTERATIONS AND ORTHO/PARALOGOUS POLIMORPHYSMS IN TRANSCRIPTION FACTOR BINDING SITE. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213005002284] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]

Fontaine JF, Priller F, Barbosa-Silva A, Andrade-Navarro MA. Génie: literature-based gene prioritization at multi genomic scale. Nucleic Acids Res 2011;39:W455-61. [PMID: 21609954 PMCID: PMC3125729 DOI: 10.1093/nar/gkr246] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Frijters R, van Vugt M, Smeets R, van Schaik R, de Vlieg J, Alkema W. Literature mining for the discovery of hidden connections between drugs, genes and diseases. PLoS Comput Biol 2010;6. [PMID: 20885778 PMCID: PMC2944780 DOI: 10.1371/journal.pcbi.1000943] [Citation(s) in RCA: 120] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/26/2010] [Indexed: 01/19/2023] Open

Abstract

The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs.

The biomedical literature is an important source of knowledge on the function of genes and on the mechanisms by which these genes regulate cellular processes. Several text mining approaches have been developed to leverage this rich source of information by automatically extracting associations between concepts such as genes, diseases and drugs from a large body of text. Here, we describe a new method that extracts novel, not yet recognized associations between genes, diseases, drugs and cellular processes from the biomedical literature. Our method is built on the assumption that even if two concepts do not have a direct connection in literature, they may be functionally related if they are both connected to an overlapping set of concepts. Using this approach we predicted several novel connections between genes, diseases, drugs and pathways. Our results imply that our method is able to predict novel relationships from literature and, most importantly, that these newly identified relationships are biologically relevant. Our method can aid the drug discovery process where it can be used to find novel drug targets, increase insight in mode of action of a drug or find novel applications for known drugs.

Collapse

Blaschke C, Hoffmann R, Oliveros JC, Valencia A. Extracting information automatically from biological literature. Comp Funct Genomics 2010;2:310-3. [PMID: 18629239 PMCID: PMC2448400 DOI: 10.1002/cfg.102] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2001] [Accepted: 07/27/2001] [Indexed: 11/13/2022] Open

Barbosa-Silva A, Soldatos TG, Magalhães ILF, Pavlopoulos GA, Fontaine JF, Andrade-Navarro MA, Schneider R, Ortega JM. LAITOR--Literature Assistant for Identification of Terms co-Occurrences and Relationships. BMC Bioinformatics 2010;11:70. [PMID: 20122157 PMCID: PMC3098111 DOI: 10.1186/1471-2105-11-70] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2009] [Accepted: 02/01/2010] [Indexed: 11/10/2022] Open

Frenz CM, Frenz DA. The Application of Regular Expression-Based Pattern Matching to Profiling the Developmental Factors that Contribute to the Development of the Inner Ear. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2010;680:165-71. [DOI: 10.1007/978-1-4419-5913-3_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

David J, Irizarry KJL. Using the PubMatrix literature-mining resource to accelerate student-centered learning in a veterinary problem-based learning curriculum. JOURNAL OF VETERINARY MEDICAL EDUCATION 2009;36:202-208. [PMID: 19625669 DOI: 10.3138/jvme.36.2.202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]

Yoo IH, Song M. Biomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey. ACTA ACUST UNITED AC 2008. [DOI: 10.5626/jcse.2008.2.2.109] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]

Fang YC, Huang HC, Juan HF. MeInfoText: associated gene methylation and cancer information from text mining. BMC Bioinformatics 2008;9:22. [PMID: 18194557 PMCID: PMC2258285 DOI: 10.1186/1471-2105-9-22] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2007] [Accepted: 01/14/2008] [Indexed: 12/02/2022] Open

The ‘Open Discovery’ Challenge. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-68690-3_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]

Zhou D, He Y. Extracting interactions between proteins from the literature. J Biomed Inform 2007;41:393-407. [PMID: 18207462 DOI: 10.1016/j.jbi.2007.11.008] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2007] [Revised: 11/21/2007] [Accepted: 11/28/2007] [Indexed: 11/29/2022]

Frenz CM. Deafness mutation mining using regular expression based pattern matching. BMC Med Inform Decis Mak 2007;7:32. [PMID: 17961241 PMCID: PMC2180167 DOI: 10.1186/1472-6947-7-32] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Accepted: 10/25/2007] [Indexed: 11/16/2022] Open

Hammamieh R, Chakraborty N, Wang Y, Laing M, Liu Z, Mulligan J, Jett M. GeneCite: a stand-alone open source tool for high-throughput literature and pathway mining. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2007;11:143-51. [PMID: 17594234 DOI: 10.1089/omi.2007.4322] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Faccioli P, Ciceri GP, Provero P, Stanca AM, Morcia C, Terzi V. A combined strategy of "in silico" transcriptome analysis and web search engine optimization allows an agile identification of reference genes suitable for normalization in gene expression studies. PLANT MOLECULAR BIOLOGY 2007;63:679-88. [PMID: 17143578 DOI: 10.1007/s11103-006-9116-9] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2006] [Accepted: 11/12/2006] [Indexed: 05/12/2023]

Chen D, Müller HM, Sternberg PW. Automatic document classification of biological literature. BMC Bioinformatics 2006;7:370. [PMID: 16893465 PMCID: PMC1559726 DOI: 10.1186/1471-2105-7-370] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2006] [Accepted: 08/07/2006] [Indexed: 12/02/2022] Open

Jensen LJ, Saric J, Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet 2006;7:119-29. [PMID: 16418747 DOI: 10.1038/nrg1768] [Citation(s) in RCA: 356] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Goetz T, von der Lieth CW. PubFinder: a tool for improving retrieval rate of relevant PubMed abstracts. Nucleic Acids Res 2005;33:W774-8. [PMID: 15980583 PMCID: PMC1160190 DOI: 10.1093/nar/gki429] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Bajic VB, Veronika M, Veladandi PS, Meka A, Heng MW, Rajaraman K, Pan H, Swarup S. Dragon Plant Biology Explorer. A text-mining tool for integrating associations between genetic and biochemical entities with genome annotation and biochemical terms lists. PLANT PHYSIOLOGY 2005;138:1914-25. [PMID: 16172098 PMCID: PMC1183383 DOI: 10.1104/pp.105.060863] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Abstract

We introduce a tool for text mining, Dragon Plant Biology Explorer (DPBE) that integrates information on Arabidopsis (Arabidopsis thaliana) genes with their functions, based on gene ontologies and biochemical entity vocabularies, and presents the associations as interactive networks. The associations are based on (1) user-provided PubMed abstracts; (2) a list of Arabidopsis genes compiled by The Arabidopsis Information Resource; (3) user-defined combinations of four vocabulary lists based on the ones developed by the general, plant, and Arabidopsis GO consortia; and (4) three lists developed here based on metabolic pathways, enzymes, and metabolites derived from AraCyc, BRENDA, and other metabolism databases. We demonstrate how various combinations can be applied to fields of (1) gene function and gene interaction analyses, (2) plant development, (3) biochemistry and metabolism, and (4) pharmacology of bioactive compounds. Furthermore, we show the suitability of DPBE for systems approaches by integration with "omics" platform outputs. Using a list of abiotic stress-related genes identified by microarray experiments, we show how this tool can be used to rapidly build an information base on the previously reported relationships. This tool complements the existing biological resources for systems biology by identifying potentially novel associations using text analysis between cellular entities based on genome annotation terms. Thus, it allows researchers to efficiently summarize existing information for a group of genes or pathways, so as to make better informed choices for designing validation experiments. Last, DPBE can be helpful for beginning researchers and graduate students to summarize vast information in an unfamiliar area. DPBE is freely available for academic and nonprofit users at http://research.i2r.a-star.edu.sg/DRAGON/ME2/.

Collapse

Shah PK, Jensen LJ, Boué S, Bork P. Extraction of transcript diversity from scientific literature. PLoS Comput Biol 2005;1:e10. [PMID: 16103899 PMCID: PMC1183516 DOI: 10.1371/journal.pcbi.0010010] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2005] [Accepted: 05/21/2005] [Indexed: 11/26/2022] Open

Abstract

Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/.

Given the functional complexity of higher eukaryotes, the relatively small number of genes in the human and other mammalian genomes came as a surprise to the scientific community. Later it was discovered that the majority of genes are subject to alternative splicing (“cutting and pasting”) or associated mechanisms that ultimately increase the diversity of transcripts that code for proteins. Studies exploring transcript diversity are currently dominated by high-throughput experiments and computational methods; however, the quality of such data should be assessed against a reliable reference set based on single-gene studies. Unfortunately, the latter type of information is scattered throughout the scientific literature. The authors have thus developed a computational approach for extracting information on alternative transcripts from MEDLINE abstracts and used it to create a database, LSAT. LSAT (Literature Support for Alternative Transcripts) provides information for more than 4,000 genes from about 14,000 abstracts. This database can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression based on single-gene studies, which we show agrees well with EST-based studies (these studies involve tissue-specific splicing detected by the analysis of libraries of expressed sequence tags [ESTs]). These results indicate that mechanisms like alternative splicing, alternative promoters, and alternative polyadenylation work in concert to generate and regulate transcript diversity. More generally, information extraction of complex biological process seems feasible and can also complement large-scale data generation in other areas to assign functions to genes.

Collapse

Schijvenaars BJA, Mons B, Weeber M, Schuemie MJ, van Mulligen EM, Wain HM, Kors JA. Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 2005;6:149. [PMID: 15958172 PMCID: PMC1183190 DOI: 10.1186/1471-2105-6-149] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2004] [Accepted: 06/16/2005] [Indexed: 11/28/2022] Open

Couto FM, Silva MJ, Coutinho PM. Finding genomic ontology terms in text using evidence content. BMC Bioinformatics 2005;6 Suppl 1:S21. [PMID: 15960834 PMCID: PMC1869014 DOI: 10.1186/1471-2105-6-s1-s21] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Suomela BP, Andrade MA. Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics 2005;6:75. [PMID: 15790421 PMCID: PMC1274266 DOI: 10.1186/1471-2105-6-75] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2004] [Accepted: 03/24/2005] [Indexed: 12/03/2022] Open

Abstract

Background

The MEDLINE database contains over 12 million references to scientific literature, with about 3/4 of recent articles including an abstract of the publication. Retrieval of entries using queries with keywords is useful for human users that need to obtain small selections. However, particular analyses of the literature or database developments may need the complete ranking of all the references in the MEDLINE database as to their relevance to a topic of interest. This report describes a method that does this ranking using the differences in word content between MEDLINE entries related to a topic and the whole of MEDLINE, in a computational time appropriate for an article search query engine.

Results

We tested the capabilities of our system to retrieve MEDLINE references which are relevant to the subject of stem cells. We took advantage of the existing annotation of references with terms from the MeSH hierarchical vocabulary (Medical Subject Headings, developed at the National Library of Medicine). A training set of 81,416 references was constructed by selecting entries annotated with the MeSH term stem cells or some child in its sub tree. Frequencies of all nouns, verbs, and adjectives in the training set were computed and the ratios of word frequencies in the training set to those in the entire MEDLINE were used to score references. Self-consistency of the algorithm, benchmarked with a test set containing the training set and an equal number of references randomly selected from MEDLINE was better using nouns (79%) than adjectives (73%) or verbs (70%). The evaluation of the system with 6,923 references not used for training, containing 204 articles relevant to stem cells according to a human expert, indicated a recall of 65% for a precision of 65%.

Conclusion

This strategy appears to be useful for predicting the relevance of MEDLINE references to a given concept. The method is simple and can be used with any user-defined training set. Choice of the part of speech of the words used for classification has important effects on performance. Lists of words, scripts, and additional information are available from the web address .

Collapse

Tiffin N, Kelso JF, Powell AR, Pan H, Bajic VB, Hide WA. Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res 2005;33:1544-52. [PMID: 15767279 PMCID: PMC1065256 DOI: 10.1093/nar/gki296] [Citation(s) in RCA: 143] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Chang JT, Altman RB. Extracting and characterizing gene-drug relationships from the literature. ACTA ACUST UNITED AC 2005;14:577-86. [PMID: 15475731 DOI: 10.1097/00008571-200409000-00002] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]

Collier N, Takeuchi K. Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 2004;37:423-35. [PMID: 15542016 DOI: 10.1016/j.jbi.2004.08.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2004] [Indexed: 10/26/2022]

Abstract

The immense volume of data which is now available from experiments in molecular biology has led to an explosion in reported results most of which are available only in unstructured text format. For this reason there has been great interest in the task of text mining to aid in fact extraction, document screening, citation analysis, and linkage with large gene and gene-product databases. In particular there has been an intensive investigation into the named entity (NE) task as a core technology in all of these tasks which has been driven by the availability of high volume training sets such as the GENIA v3.02 corpus. Despite such large training sets accuracy for biology NE has proven to be consistently far below the high levels of performance in the news domain where F scores above 90 are commonly reported which can be considered near to human performance. We argue that it is crucial that more rigorous analysis of the factors that contribute to the model's performance be applied to discover where the underlying limitations are and what our future research direction should be. Our investigation in this paper reports on variations of two widely used feature types, part of speech (POS) tags and character-level orthographic features, and makes a comparison of how these variations influence performance. We base our experiments on a proven state-of-the-art model, support vector machines using a high quality subset of 100 annotated MEDLINE abstracts. Experiments reveal that the best performing features are orthographic features with F score of 72.6. Although the Brill tagger trained in-domain on the GENIA v3.02p POS corpus gives the best overall performance of any POS tagger, at an F score of 68.6, this is still significantly below the orthographic features. In combination these two features types appear to interfere with each other and degrade performance slightly to an F score of 72.3.

Collapse

Müller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004;2:e309. [PMID: 15383839 PMCID: PMC517822 DOI: 10.1371/journal.pbio.0020309] [Citation(s) in RCA: 415] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2003] [Accepted: 07/19/2004] [Indexed: 11/19/2022] Open

Abstract

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.

With the increasing availability of full-text scientific papers online, new tools, such as Textpresso, will help to extract information and knowledge from research literature

Collapse

Pan H, Zuo L, Choudhary V, Zhang Z, Leow SH, Chong FT, Huang Y, Ong VWS, Mohanty B, Tan SL, Krishnan SPT, Bajic VB. Dragon TF Association Miner: a system for exploring transcription factor associations through text-mining. Nucleic Acids Res 2004;32:W230-4. [PMID: 15215386 PMCID: PMC441622 DOI: 10.1093/nar/gkh484] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Rebholz-Schuhmann D, Marcel S, Albert S, Tolle R, Casari G, Kirsch H. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res 2004;32:135-42. [PMID: 14704350 PMCID: PMC373272 DOI: 10.1093/nar/gkh162] [Citation(s) in RCA: 69] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Shah PK, Perez-Iratxeta C, Bork P, Andrade MA. Information extraction from full text scientific articles: where are the keywords? BMC Bioinformatics 2003;4:20. [PMID: 12775220 PMCID: PMC166134 DOI: 10.1186/1471-2105-4-20] [Citation(s) in RCA: 114] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2003] [Accepted: 05/29/2003] [Indexed: 11/12/2022] Open

Ponomarenko J, Merkulova T, Orlova G, Fokin O, Gorshkov E, Ponomarenko M. Mining DNA sequences to predict sites which mutations cause genetic diseases. Knowl Based Syst 2002. [DOI: 10.1016/s0950-7051(01)00144-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]

Rabow AA, Shoemaker RH, Sausville EA, Covell DG. Mining the National Cancer Institute's tumor-screening database: identification of compounds with similar cellular activities. J Med Chem 2002;45:818-40. [PMID: 11831894 DOI: 10.1021/jm010385b] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Mellor JC, Yanai I, Clodfelter KH, Mintseris J, DeLisi C. Predictome: a database of putative functional links between proteins. Nucleic Acids Res 2002;30:306-9. [PMID: 11752322 PMCID: PMC99135 DOI: 10.1093/nar/30.1.306] [Citation(s) in RCA: 107] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Perez-Iratxeta C, Bork P, Andrade MA. XplorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci 2001;26:573-5. [PMID: 11551795 DOI: 10.1016/s0968-0004(01)01926-0] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Kolesov G, Mewes HW, Frishman D. SNAPping up functionally related genes based on context information: a colinearity-free approach. J Mol Biol 2001;311:639-56. [PMID: 11518521 DOI: 10.1006/jmbi.2001.4701] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Bioinformatics and data knowledge: the new frontiers for nutrition and foods. Trends Food Sci Technol 2001. [DOI: 10.1016/s0924-2244(01)00089-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]

Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001;28:21-8. [PMID: 11326270 DOI: 10.1038/ng0501-21] [Citation(s) in RCA: 382] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447185 DOI: 10.1002/cfg.55] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Data mining. Nat Biotechnol 2000. [DOI: 10.1038/80073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]