1
|
Fu L, Weng Z, Zhang J, Xie H, Cao Y. MMBERT: a unified framework for biomedical named entity recognition. Med Biol Eng Comput 2024; 62:327-341. [PMID: 37833517 DOI: 10.1007/s11517-023-02934-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Accepted: 09/07/2023] [Indexed: 10/15/2023]
Abstract
Named entity recognition (NER) is an important task in natural language processing (NLP). In recent years, NER has attracted much attention in the biomedical field. However, due to the lack of biomedical named entity identification datasets, the complexity and rarity of biomedical named entities and so on, biomedical NER is more difficult than general domain NER. So in this paper, we propose a framework (MMBERT) based on Transformer to solve the problems above. To address the scarcity of biomedical named entity recognition datasets, we introduce ERNIE-Health, a new Chinese language representation model pre-trained on large-scale biomedical text corpora. Because of the complexity and rarity of biomedical named entities, we use the Bert and CW-LSTM structures to get the joint feature vector of word pairs relations. In addition, we design multi-granularity 2D convolution to refine the relationship and representation between word pairs. Finally, we design a convolutional neural network (CNN) structure and a co-predictor to improve the model's generalization capability and prediction accuracy. We have conducted extensive experiments on three benchmark datasets, and the experimental results show that our model achieves the best results compared with several baseline models in the experiment.
Collapse
Affiliation(s)
- Lei Fu
- College of Electromechanical and Information Engineering, PuTian University, PuTian, 351100, Fujian Province, China
| | - Zuquan Weng
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350000, Fujian Province, China.
- The Centre for Big Data Research in Burns and Trauma, College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350000, Fujian Province, China.
| | - Jiheng Zhang
- College of Biological Science and Engineering, Fuzhou University, Fuzhou, 350000, Fujian Province, China
- The Centre for Big Data Research in Burns and Trauma, College of Mathematics and Computer Science, Fuzhou University, Fuzhou, 350000, Fujian Province, China
| | - Haihe Xie
- College of Electromechanical and Information Engineering, PuTian University, PuTian, 351100, Fujian Province, China
| | - Yiqing Cao
- College of Electromechanical and Information Engineering, PuTian University, PuTian, 351100, Fujian Province, China
| |
Collapse
|
2
|
Groza T, Wu H, Dinger ME, Danis D, Hilton C, Bagley A, Davids JR, Luo L, Lu Z, Robinson PN. Term-BLAST-like alignment tool for concept recognition in noisy clinical texts. Bioinformatics 2023; 39:btad716. [PMID: 38001031 PMCID: PMC10710372 DOI: 10.1093/bioinformatics/btad716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/20/2023] [Accepted: 11/23/2023] [Indexed: 11/26/2023] Open
Abstract
MOTIVATION Methods for concept recognition (CR) in clinical texts have largely been tested on abstracts or articles from the medical literature. However, texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other nonstandard ways of representing clinical concepts. RESULTS Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches. AVAILABILITY AND IMPLEMENTATION Fenominal is a Java library that implements TBLAT for named CR of Human Phenotype Ontology terms and is available at https://github.com/monarch-initiative/fenominal under the GNU General Public License v3.0.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children’s Hospital, Nedlands, WA 6009, Australia
- Genetics and Rare Diseases Program, Telethon Kids Institute, Nedlands, WA 6009, Australia
| | - Honghan Wu
- Institute of Health Informatics, University College London, London WC1E 6BT, United Kingdom
| | - Marcel E Dinger
- Pryzm Health, Sydney, NSW 2089, Australia
- School of Life and Environmental Sciences, Faculty of Science, University of Sydney, NSW 2006, Australia
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
| | - Coleman Hilton
- Shriners Children’s Corporate Headquarters, Tampa, FL 33607, United States
| | - Anita Bagley
- Shriners Children's Northern California, Sacramento, CA 95817, United States
| | - Jon R Davids
- Shriners Children's Northern California, Sacramento, CA 95817, United States
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, United States
| |
Collapse
|
3
|
Tsujimura T, Miwa M, Sasaki Y. Large-scale neural biomedical entity linking with layer overwriting. J Biomed Inform 2023:104433. [PMID: 37385326 DOI: 10.1016/j.jbi.2023.104433] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 06/10/2023] [Accepted: 06/19/2023] [Indexed: 07/01/2023]
Abstract
MOTIVATION Entity linking is the task of linking entity mentions to the database entries corresponding to the entity mentions. Entity linking enables the treatment of superficially different but semantically identical mentions as the same entity. Since millions of concepts are listed in biomedical databases, selecting the correct database entry for each targeted entity is challenging. Simple string matching between the word and each synonym in biomedical databases is insufficient to handle a wide variety of variants of biomedical entities appearing in the biomedical literature. Recent progress in neural approaches is promising for entity linking. Still, existing neural methods require sufficient data, which is difficult to prepare in biomedical entity linking that deals with millions of biomedical concepts. Therefore, we need to develop a new neural method to train entity-linking models over the sparse training data covering a very limited part of the biomedical concepts. RESULTS We have devised a pure neural model that classifies biomedical entity mentions into millions of biomedical concepts. The classifier employs (1) the layer overwriting that breaks through the performance ceiling during training, (2) training data augmentation using database entries that compensate for the problem of insufficient training data, and (3) the cosine similarity-based loss function that helps distinguish the millions of biomedical concepts. Our system using the proposed classifier was ranked first in the official run of the National NLP Clinical Challenges (n2c2) 2019 Track 3, which targeted linking medical/clinical entity mentions to 434,056 Concept Unique Identifier (CUI) entries. We also applied our system to the MedMentions dataset, which has 3.2M candidate concepts. Experimental results confirmed the same advantages of our proposed method. We further evaluated our system on the NLM-CHEM corpus with 350K candidate concepts, and our system achieved a new state-of-the-art performance on the corpus. AVAILABILITY https://github.com/tti-coin/bio-linking Contact:makoto.miwa@toyota-ti.ac.jp.
Collapse
Affiliation(s)
- Tomoki Tsujimura
- Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, 468-8511, Aichi, Japan.
| | - Makoto Miwa
- Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, 468-8511, Aichi, Japan
| | - Yutaka Sasaki
- Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, 468-8511, Aichi, Japan
| |
Collapse
|
4
|
Tharmakulasingam M, Gardner B, La Ragione R, Fernando A. Rectified Classifier Chains for Prediction of Antibiotic Resistance From Multi-Labelled Data With Missing Labels. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:625-636. [PMID: 35130168 DOI: 10.1109/tcbb.2022.3148577] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Predicting Antimicrobial Resistance (AMR) from genomic data has important implications for human and animal healthcare, and especially given its potential for more rapid diagnostics and informed treatment choices. With the recent advances in sequencing technologies, applying machine learning techniques for AMR prediction have indicated promising results. Despite this, there are shortcomings in the literature concerning methodologies suitable for multi-drug AMR prediction and especially where samples with missing labels exist. To address this shortcoming, we introduce a Rectified Classifier Chain (RCC) method for predicting multi-drug resistance. This RCC method was tested using annotated features of genomics sequences and compared with similar multi-label classification methodologies. We found that applying the eXtreme Gradient Boosting (XGBoost) base model to our RCC model outperformed the second-best model, XGBoost based binary relevance model, by 3.3% in Hamming accuracy and 7.8% in F1-score. Additionally, we note that in the literature machine learning models applied to AMR prediction typically are unsuitable for identifying biomarkers informative of their decisions; in this study, we show that biomarkers contributing to AMR prediction can also be identified using the proposed RCC method. We expect this can facilitate genome annotation and pave the path towards identifying new biomarkers indicative of AMR.
Collapse
|
5
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
6
|
Allaoui H, Rached N, Marrakchi N, Cherif A, Mosbah A, Messadi E. In Silico Study of the Mechanisms Underlying the Action of the Snake Natriuretic-Like Peptide Lebetin 2 during Cardiac Ischemia. Toxins (Basel) 2022; 14. [PMID: 36422961 DOI: 10.3390/toxins14110787] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 11/07/2022] [Accepted: 11/09/2022] [Indexed: 11/16/2022] Open
Abstract
Lebetin 2 (L2), a natriuretic-like peptide (NP), exerts potent cardioprotection in myocardial infarction (MI), with stronger effects than B-type natriuretic peptide (BNP). To determine the molecular mechanisms underlying its cardioprotection effect, we used molecular modeling, molecular docking and molecular dynamics (MD) simulation to describe the binding mode, key interaction residues as well as mechanistic insights into L2 interaction with NP receptors (NPRs). L2 binding affinity was determined for human, rat, mouse and chicken NPRs, and the stability of receptor-ligand complexes ascertained during 100 ns-long MD simulations. We found that L2 exhibited higher affinity for all human NPRs compared to BNP, with a rank preference for NPR-A > NPR-C > NPR-B. Moreover, L2 affinity for human NPR-A and NPR-C was higher in other species. Both docking and MD studies revealed that the NPR-C-L2 interaction was stronger in all species compared to BNP. Due to its higher affinity to human receptors, L2 could be used as a therapeutic approach in MI patients. Moreover, the stronger interaction of L2 with NPR-C could highlight a new L2 signaling pathway that would explain its additional effects during cardiac ischemia. Thus, L2 is a promising candidate for drug design toward novel compounds with high potency, affinity and stability.
Collapse
|
7
|
Lücking A, Driller C, Stoeckel M, Abrami G, Pachzelt A, Mehler A. Multiple annotation for biodiversity: developing an annotation framework among biology, linguistics and text technology. LANG RESOUR EVAL 2021. [DOI: 10.1007/s10579-021-09553-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractBiodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However, to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the bio text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of bio is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon- or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the bio annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.
Collapse
|
8
|
Johnson NA, Smith CH. Novel Molecular Resources to Facilitate Future Genetics Research on Freshwater Mussels (Bivalvia: Unionidae). Data 2020; 5:65. [DOI: 10.3390/data5030065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Molecular data have been an integral tool in the resolution of the evolutionary relationships and systematics of freshwater mussels, despite the limited number of nuclear markers available for Sanger sequencing. To facilitate future studies, we evaluated the phylogenetic informativeness of loci from the recently published anchored hybrid enrichment (AHE) probe set Unioverse and developed novel Sanger primer sets to amplify two protein-coding nuclear loci with high net phylogenetic informativeness scores: fem-1 homolog C (FEM1) and UbiA prenyltransferase domain-containing protein 1 (UbiA). We report the methods used for marker development, along with the primer sequences and optimized PCR and thermal cycling conditions. To demonstrate the utility of these markers, we provide haplotype networks, DNA alignments, and summary statistics regarding the sequence variation for the two protein-coding nuclear loci (FEM1 and UbiA). Additionally, we compare the DNA sequence variation of FEM1 and UbiA to three loci commonly used in freshwater mussel genetic studies: the mitochondrial genes cytochrome c oxidase subunit 1 (CO1) and NADH dehydrogenase subunit 1 (ND1), and the nuclear internal transcribed spacer 1 (ITS1). All five loci distinguish among the three focal species (Potamilus fragilis, Potamilus inflatus, and Potamilus purpuratus), and the sequence variation was highest for ND1, followed by CO1, ITS1, UbiA, and FEM1, respectively. The newly developed Sanger PCR primers and methodologies for extracting additional loci from AHE probe sets have great potential to facilitate molecular investigations targeting supraspecific relationships in freshwater mussels, but may be of limited utility at shallow taxonomic scales.
Collapse
|
9
|
Wang Y, Xu J, Kong L, Liu T, Yi L, Wang H, Huang WE, Zheng C. Raman-deuterium isotope probing to study metabolic activities of single bacterial cells in human intestinal microbiota. Microb Biotechnol 2019; 13:572-583. [PMID: 31821744 PMCID: PMC7017835 DOI: 10.1111/1751-7915.13519] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 11/15/2019] [Indexed: 12/22/2022] Open
Abstract
Human intestinal microbiota is important to host health and is associated with various diseases. It is a challenge to identify the functions and metabolic activity of microorganisms at the single-cell level in gut microbial community. In this study, we applied Raman microspectroscopy and deuterium isotope probing (Raman-DIP) to quantitatively measure the metabolic activities of intestinal bacteria from two individuals and analysed lipids and phenylalanine metabolic pathways of functional microorganisms in situ. After anaerobically incubating the human faeces with heavy water (D2 O), D2 O with specific substrates (glucose, tyrosine, tryptophan and oleic acid) and deuterated glucose, the C-D band in single-cell Raman spectra appeared in some bacteria in faeces, due to the Raman shift from the C-H band. Such Raman shift was used to indicate the general metabolic activity and the activities in response to the specific substrates. In the two individuals' intestinal microbiota, the structures of the microbial communities were different and the general metabolic activities were 76 ± 1.0% and 30 ± 2.0%. We found that glucose, but not tyrosine, tryptophan and oleic acid, significantly stimulated metabolic activity of the intestinal bacteria. We also demonstrated that the bacteria within microbiota preferably used glucose to synthesize fatty acids in faeces environment, whilst they used glucose to synthesize phenylalanine in laboratory growth environment (e.g. LB medium). Our work provides a useful approach for investigating the metabolic activity in situ and revealing different pathways of human intestinal microbiota at the single-cell level.
Collapse
Affiliation(s)
- Yi Wang
- School of Environment, Harbin Institute of Technology, Harbin, 150090, China.,Guangdong Provincial Key Laboratory of Soil and Groundwater Pollution Control, School of Environmental Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China.,Department of Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ, UK
| | - Jiabao Xu
- Department of Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ, UK
| | - Lingchao Kong
- School of Environment, Harbin Institute of Technology, Harbin, 150090, China
| | - Tang Liu
- Guangdong Provincial Key Laboratory of Soil and Groundwater Pollution Control, School of Environmental Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
| | - Lingbo Yi
- Health Time Gene Institute, Shenzhen, 518000, China
| | | | - Wei E Huang
- Department of Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ, UK
| | - Chunmiao Zheng
- Guangdong Provincial Key Laboratory of Soil and Groundwater Pollution Control, School of Environmental Science and Engineering, Southern University of Science and Technology, Shenzhen, 518055, China
| |
Collapse
|
10
|
Labbé C, Grima N, Gautier T, Favier B, Byrne JA. Semi-automated fact-checking of nucleotide sequence reagents in biomedical research publications: The Seek & Blastn tool. PLoS One 2019; 14:e0213266. [PMID: 30822319 PMCID: PMC6396917 DOI: 10.1371/journal.pone.0213266] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 02/18/2019] [Indexed: 12/14/2022] Open
Abstract
Nucleotide sequence reagents are verifiable experimental reagents in biomedical publications, because their sequence identities can be independently verified and compared with associated text descriptors. We have previously reported that incorrectly identified nucleotide sequence reagents are characteristic of highly similar human gene knockdown studies, some of which have been retracted from the literature on account of possible research fraud. Because of the throughput limitations of manual verification of nucleotide sequences, we developed a semi-automated fact checking tool, Seek & Blastn, to verify the targeting or non-targeting status of published nucleotide sequence reagents. From previously described and unknown corpora of 48 and 155 publications, respectively, Seek & Blastn correctly extracted 304/342 (88.9%) and 1066/1522 (70.0%) nucleotide sequences and a predicted targeting/ non-targeting status. Seek & Blastn correctly predicted the targeting/ non-targeting status of 293/304 (96.4%) and 988/1066 (92.7%) of the correctly extracted nucleotide sequences. A total of 38/39 (97.4%) or 31/79 (39.2%) Seek & Blastn predictions of incorrect nucleotide sequence reagent use were correct in the two literature corpora. Combined Seek & Blastn and manual analyses identified a list of 91 misidentified nucleotide sequence reagents, which could be built upon through future studies. In summary, incorrect nucleotide sequence reagents represent an under-recognized source of error within the biomedical literature, and fact checking tools such as Seek & Blastn may help to identify papers and manuscripts affected by these errors.
Collapse
Affiliation(s)
- Cyril Labbé
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
- * E-mail: (CL); (JAB)
| | - Natalie Grima
- Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, New South Wales, Australia
| | - Thierry Gautier
- INSERM U1209/ CNRS UMR 5309, Univ. Grenoble Alpes, Grenoble, France
| | - Bertrand Favier
- Univ. Grenoble Alpes, Team GREPI, Etablissement Français du Sang, La Tronche, France
| | - Jennifer A. Byrne
- Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, New South Wales, Australia
- Discipline of Child and Adolescent Health, Faculty of Medicine and Health, The University of Sydney, Westmead, New South Wales, Australia
- * E-mail: (CL); (JAB)
| |
Collapse
|
11
|
Wu H, Lu D, Hyder M, Zhang S, Quinney SK, Desta Z, Li L. DrugMetab: An Integrated Machine Learning and Lexicon Mapping Named Entity Recognition Method for Drug Metabolite. CPT Pharmacometrics Syst Pharmacol 2018; 7:709-717. [PMID: 30033622 PMCID: PMC6263660 DOI: 10.1002/psp4.12340] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Accepted: 06/25/2018] [Indexed: 11/29/2022] Open
Abstract
Drug metabolites (DMs) are critical in pharmacology research areas, such as drug metabolism pathways and drug-drug interactions. However, there is no terminology dictionary containing comprehensive drug metabolite names, and there is no named entity recognition (NER) algorithm focusing on drug metabolite identification. In this article, we developed a novel NER system, DrugMetab, to identify DMs from the PubMed abstracts. DrugMetab utilizes the features characterized from the Part-of-Speech, drug index, and pre/suffix, and determines DMs within context. To evaluate the performance, a gold-standard corpus was manually constructed. In this task, DrugMetab with sequential minimal optimization (SMO) classifier achieves 0.89 precision, 0.77 recall, and 0.83 F-measure in the internal testing set; and 0.86 precision, 0.85 recall, and 0.86 F-measure in the external validation set. We further compared the performance between DrugMetab and whatizitChemical, which was designed for identifying small molecules or chemical entities. DrugMetab outperformed whatizitChemical, which had a lower recall rate of 0.65.
Collapse
Affiliation(s)
- Heng‐Yi Wu
- Department of Biomedical InformaticsCollege of MedicineThe Ohio State UniversityColumbusOhioUSA
| | - Deshun Lu
- Center for Computational Biology and BioinformaticsSchool of MedicineIndiana UniversityIndianapolisIndianaUSA
| | - Mustafa Hyder
- Division of Clinical PharmacologySchool of MedicineIndiana UniversityIndianapolisIndianaUSA
| | - Shijun Zhang
- Department of Biomedical InformaticsCollege of MedicineThe Ohio State UniversityColumbusOhioUSA
| | - Sara K. Quinney
- Division of Clinical PharmacologySchool of MedicineIndiana UniversityIndianapolisIndianaUSA
| | - Zeruesenay Desta
- Division of Clinical PharmacologySchool of MedicineIndiana UniversityIndianapolisIndianaUSA
| | - Lang Li
- Department of Biomedical InformaticsCollege of MedicineThe Ohio State UniversityColumbusOhioUSA
| |
Collapse
|
12
|
Abstract
As a significant determinant in the development of named entity recognition, phenotypic descriptions are normally presented differently in biomedical literature with the use of complicated semantics. In this paper, a novel approach has been proposed to identify plant phenotypes by adopting word embedding to sentence embedding cascaded approach. We make use of a word embedding method to find high-frequency phenotypes with original sentences used as input in a sentence embedding method. In doing so, a variety of complicated phenotypic expressions can be recognized accurately. Besides, the state-of-the-art word representation models have been compared and among them, skip-gram with negative sampling was selected with the best performance. To evaluate the performance of our approach, we applied it to the dataset composed of 56 748 PubMed abstracts of model organism Arabidopsis thaliana. The experiment results showed that our approach yielded the best performance, as it achieved a 2.588-fold increase in terms of the number of new phenotypic descriptions when compared to the original phenotype ontology.
Collapse
|
13
|
Sumathipala S, Yamada K, Unehara M, Suzuki I. Protein Entity Name Recognition Using Orthographic, Morphological and Proteinhood Features. J Adv Comput Intell Intell Inform 2015. [DOI: 10.20965/jaciii.2015.p0843] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein name identification in text is an important and challenging fundamental precursor in biomedical information processing. For example, accurate identification of protein names affects the finding of protein-protein interactions from biomedical literature. In this paper, we present an efficient protein name identification technique based on a rich set of features: orthographic, morphological as well as Proteinhood features which are introduced newly in this study. The method was evaluated on GENIA corpus with the use of different machine learning algorithms. The highest values for precision 92.1%, recall 86.5% and F-measure 89.2% were achieved on Random Forest, while reducing the training and testing time significantly. We studied and showed the impact of the Proteinhood feature in protein identification as well as the effect of tuning the parameters of the machine learning algorithm.
Collapse
|
14
|
Abstract
A scientist's choice of research problem affects his or her personal career trajectory. Scientists' combined choices affect the direction and efficiency of scientific discovery as a whole. In this paper, we infer preferences that shape problem selection from patterns of published findings and then quantify their efficiency. We represent research problems as links between scientific entities in a knowledge network. We then build a generative model of discovery informed by qualitative research on scientific problem selection. We map salient features from this literature to key network properties: an entity's importance corresponds to its degree centrality, and a problem's difficulty corresponds to the network distance it spans. Drawing on millions of papers and patents published over 30 years, we use this model to infer the typical research strategy used to explore chemical relationships in biomedicine. This strategy generates conservative research choices focused on building up knowledge around important molecules. These choices become more conservative over time. The observed strategy is efficient for initial exploration of the network and supports scientific careers that require steady output, but is inefficient for science as a whole. Through supercomputer experiments on a sample of the network, we study thousands of alternatives and identify strategies much more efficient at exploring mature knowledge networks. We find that increased risk-taking and the publication of experimental failures would substantially improve the speed of discovery. We consider institutional shifts in grant making, evaluation, and publication that would help realize these efficiencies.
Collapse
|
15
|
Abstract
BACKGROUND Authentic identification of plants is essential for exploiting their medicinal properties as well as to stop the adulteration and malpractices with the trade of the same. OBJECTIVE To identify a herbal powder obtained from a herbalist in the local vicinity of Rajkot, Gujarat, using deoxyribonucleic acid (DNA) barcoding and molecular tools. MATERIALS AND METHODS The DNA was extracted from a herbal powder and selected Cassia species, followed by the polymerase chain reaction (PCR) and sequencing of the rbcL barcode locus. Thereafter the sequences were subjected to National Center for Biotechnology Information (NCBI) basic local alignment search tool (BLAST) analysis, followed by the protein three-dimension structure determination of the rbcL protein from the herbal powder and Cassia species namely Cassia fistula, Cassia tora and Cassia javanica (sequences obtained in the present study), Cassia Roxburghii, and Cassia abbreviata (sequences retrieved from Genbank). Further, the multiple and pairwise structural alignment were carried out in order to identify the herbal powder. RESULTS The nucleotide sequences obtained from the selected species of Cassia were submitted to Genbank (Accession No. JX141397, JX141405, JX141420). The NCBI BLAST analysis of the rbcL protein from the herbal powder showed an equal sequence similarity (with reference to different parameters like E value, maximum identity, total score, query coverage) to C. javanica and C. roxburghii. In order to solve the ambiguities of the BLAST result, a protein structural approach was implemented. The protein homology models obtained in the present study were submitted to the protein model database (PM0079748-PM0079753). The pairwise structural alignment of the herbal powder (as template) and C. javanica and C. roxburghii (as targets individually) revealed a close similarity of the herbal powder with C. javanica. CONCLUSION A strategy as used here, incorporating the integrated use of DNA barcoding and protein structural analyses could be adopted, as a novel rapid and economic procedure, especially in cases when protein coding loci are considered. SUMMARY Authentic identification of plants is essential for exploiting their medicinal properties as well as to stop the adulteration and malpractices with the trade of the same. A herbal powder was obtained from a herbalist in the local vicinity of Rajkot, Gujarat. An integrated approach using DNA barcoding and structural analyses was carried out to identify the herbal powder. The herbal powder was identified as Cassia javanica L.
Collapse
Affiliation(s)
- Bhavisha P. Sheth
- Department of Biosciences, Centre for Advanced Studies in Plant Biotechnology and Genetic Engineering, Saurashtra University, Rajkot, Gujarat, India
| | - Vrinda S. Thaker
- Department of Biosciences, Centre for Advanced Studies in Plant Biotechnology and Genetic Engineering, Saurashtra University, Rajkot, Gujarat, India
| |
Collapse
|
16
|
Blair DR, Wang K, Nestorov S, Evans JA, Rzhetsky A. Quantifying the impact and extent of undocumented biomedical synonymy. PLoS Comput Biol 2014; 10:e1003799. [PMID: 25255227 PMCID: PMC4177665 DOI: 10.1371/journal.pcbi.1003799] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2013] [Accepted: 06/26/2014] [Indexed: 12/14/2022] Open
Abstract
Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through “crowd-sourcing.” Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for “next-generation,” high-coverage lexical terminologies. Automated systems that extract and integrate information from the research literature have become common in biomedicine. As the same meaning can be expressed in many distinct but synonymous ways, access to comprehensive thesauri may enable such systems to maximize their performance. Here, we establish the importance of synonymy for a specific text-mining task (named-entity normalization), and we suggest that current thesauri may be woefully inadequate in their documentation of this linguistic phenomenon. To test this claim, we develop a model for estimating the amount of missing synonymy. We apply our model to both biomedical terminologies and general-English thesauri, predicting massive amounts of missing synonymy for both lexicons. Furthermore, we verify some of our predictions for the latter domain through “crowd-sourcing.” Overall, our work highlights the dramatic incompleteness of current biomedical thesauri, and to mitigate this issue, we propose the creation of “living” terminologies, which would automatically harvest undocumented synonymy and help smart machines enrich biomedicine.
Collapse
Affiliation(s)
- David R. Blair
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, United States of America
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, Illinois, United States of America
| | - Kanix Wang
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, United States of America
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, Illinois, United States of America
| | - Svetlozar Nestorov
- Computation Institute, University of Chicago, Chicago, Illinois, United States of America
| | - James A. Evans
- Computation Institute, University of Chicago, Chicago, Illinois, United States of America
- Department of Sociology, University of Chicago, Chicago, Illinois, United States of America
| | - Andrey Rzhetsky
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois, United States of America
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, University of Chicago, Chicago, Illinois, United States of America
- Departments of Medicine and Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
17
|
Collier N, Tran MV, Le HQ, Ha QT, Oellrich A, Rebholz-Schuhmann D. Learning to recognize phenotype candidates in the auto-immune literature using SVM re-ranking. PLoS One 2013; 8:e72965. [PMID: 24155869 PMCID: PMC3796529 DOI: 10.1371/journal.pone.0072965] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Accepted: 07/15/2013] [Indexed: 11/19/2022] Open
Abstract
The identification of phenotype descriptions in the scientific literature, case reports and patient records is a rewarding task for bio-medical text mining. Any progress will support knowledge discovery and linkage to other resources. However because of their wide variation a number of challenges still remain in terms of their identification and semantic normalisation before they can be fully exploited for research purposes. This paper presents novel techniques for identifying potential complex phenotype mentions by exploiting a hybrid model based on machine learning, rules and dictionary matching. A systematic study is made of how to combine sequence labels from these modules as well as the merits of various ontological resources. We evaluated our approach on a subset of Medline abstracts cited by the Online Mendelian Inheritance of Man database related to auto-immune diseases. Using partial matching the best micro-averaged F-score for phenotypes and five other entity classes was 79.9%. A best performance of 75.3% was achieved for phenotype candidates using all semantics resources. We observed the advantage of using SVM-based learn-to-rank for sequence label combination over maximum entropy and a priority list approach. The results indicate that the identification of simple entity types such as chemicals and genes are robustly supported by single semantic resources, whereas phenotypes require combinations. Altogether we conclude that our approach coped well with the compositional structure of phenotypes in the auto-immune domain.
Collapse
Affiliation(s)
- Nigel Collier
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom
- National Institute of Informatics, Tokyo, Japan
- * E-mail:
| | - Mai-vu Tran
- National Institute of Informatics, Tokyo, Japan
- Knowledge Technology Laboratory, University of Engineering and Technology - VNU, Hanoi, Vietnam
| | - Hoang-quynh Le
- National Institute of Informatics, Tokyo, Japan
- Knowledge Technology Laboratory, University of Engineering and Technology - VNU, Hanoi, Vietnam
| | - Quang-Thuy Ha
- Knowledge Technology Laboratory, University of Engineering and Technology - VNU, Hanoi, Vietnam
| | - Anika Oellrich
- Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Dietrich Rebholz-Schuhmann
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom
- Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
| |
Collapse
|
18
|
Dinh D, Tamine L, Boubekeur F. Factors affecting the effectiveness of biomedical document indexing and retrieval based on terminologies. Artif Intell Med 2013; 57:155-67. [DOI: 10.1016/j.artmed.2012.08.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2011] [Revised: 08/26/2012] [Accepted: 08/30/2012] [Indexed: 11/26/2022]
|
19
|
García-Remesal M, García-Ruiz A, Pérez-Rey D, de la Iglesia D, Maojo V. Using nanoinformatics methods for automatically identifying relevant nanotoxicology entities from the literature. Biomed Res Int 2012; 2013:410294. [PMID: 23509721 PMCID: PMC3591181 DOI: 10.1155/2013/410294] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/08/2012] [Revised: 07/03/2012] [Accepted: 07/10/2012] [Indexed: 01/12/2023]
Abstract
Nanoinformatics is an emerging research field that uses informatics techniques to collect, process, store, and retrieve data, information, and knowledge on nanoparticles, nanomaterials, and nanodevices and their potential applications in health care. In this paper, we have focused on the solutions that nanoinformatics can provide to facilitate nanotoxicology research. For this, we have taken a computational approach to automatically recognize and extract nanotoxicology-related entities from the scientific literature. The desired entities belong to four different categories: nanoparticles, routes of exposure, toxic effects, and targets. The entity recognizer was trained using a corpus that we specifically created for this purpose and was validated by two nanomedicine/nanotoxicology experts. We evaluated the performance of our entity recognizer using 10-fold cross-validation. The precisions range from 87.6% (targets) to 93.0% (routes of exposure), while recall values range from 82.6% (routes of exposure) to 87.4% (toxic effects). These results prove the feasibility of using computational approaches to reliably perform different named entity recognition (NER)-dependent tasks, such as for instance augmented reading or semantic searches. This research is a "proof of concept" that can be expanded to stimulate further developments that could assist researchers in managing data, information, and knowledge at the nanolevel, thus accelerating research in nanomedicine.
Collapse
Affiliation(s)
- Miguel García-Remesal
- Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Boadilla del Monte, 28660 Madrid, Spain.
| | | | | | | | | |
Collapse
|
20
|
Thessen AE, Cui H, Mozzherin D. Applications of natural language processing in biodiversity science. Adv Bioinformatics 2012; 2012:391574. [PMID: 22685456 DOI: 10.1155/2012/391574] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2011] [Accepted: 02/15/2012] [Indexed: 12/11/2022] Open
Abstract
Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science.
A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.
Collapse
|
21
|
Galvez C, de Moya‐Anegón F. A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature. Journal of Documentation 2012. [DOI: 10.1108/00220411211200301] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
22
|
NG EYK, TAY LL. STUDY OF BLAST DNA MATCHING TOOLKITS. J MECH MED BIOL 2011. [DOI: 10.1142/s0219519404001090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The beginning of bioinformatics saw the development of algorithms that enabled the storage of nucleic acid and protein sequences in the form of annotated databases in a manner that would allow researchers to exchange information about gene and protein sequences easily and quickly. Databases are growing extremely fast, hence it is essential to use the current databases, which are easily available on the Web. This tutorial deals with the concept of DNA matching by using BLAST programs such as BLASTN and MEGABLAST to perform similarity sequence search and to evaluate their relative effectiveness. Interpretation of the BLAST results is done. Comparisons between the two algorithms are included based on varying parameters such as word sizes, query sequences length and gap X drop-off values, etc. It is found that as the word size increases, the computation time for both BLASTN and MEGABLAST algorithms decreases. BLASTN is more sensitive than MEGABLAST since it uses a shorter default word size of 11 as compared to MEGABLAST, which uses a default word size of 28. The search strategy offers a tradeoff between speed and sensitivity. As for BLAST 2 Sequences, MEGABLAST could perform better than BLASTN only for large word sizes greater than or equal to 16 and for longer sequences.
Collapse
Affiliation(s)
- E. Y.-K. NG
- School of Mechanical & Production Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
| | - L. L. TAY
- ST Microelectronics Pte Ltd, 28 Ang Mo Kio Industrial Park 2, Singapore 569508, Singapore
| |
Collapse
|
23
|
Garten Y, Coulet A, Altman RB. Recent progress in automatically extracting information from the pharmacogenomic literature. Pharmacogenomics 2011; 11:1467-89. [PMID: 21047206 DOI: 10.2217/pgs.10.136] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications.
Collapse
Affiliation(s)
- Yael Garten
- Biomedical Informatics, Stanford University, Stanford, CA 94305, USA
| | | | | |
Collapse
|
24
|
DeSantis TZ, Keller K, Karaoz U, Alekseyenko AV, Singh NNS, Brodie EL, Pei Z, Andersen GL, Larsen N. Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecol 2011; 11:11. [PMID: 21524302 PMCID: PMC3097142 DOI: 10.1186/1472-6785-11-11] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2010] [Accepted: 04/27/2011] [Indexed: 02/01/2023] Open
Abstract
Background Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Results Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Conclusions Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.
Collapse
Affiliation(s)
- Todd Z DeSantis
- Ecology Department, Lawrence Berkeley National Laboratory, Berkeley, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Ruiz JC, D'Afonseca V, Silva A, Ali A, Pinto AC, Santos AR, Rocha AAMC, Lopes DO, Dorella FA, Pacheco LGC, Costa MP, Turk MZ, Seyffert N, Moraes PMRO, Soares SC, Almeida SS, Castro TLP, Abreu VAC, Trost E, Baumbach J, Tauch A, Schneider MPC, McCulloch J, Cerdeira LT, Ramos RTJ, Zerlotini A, Dominitini A, Resende DM, Coser EM, Oliveira LM, Pedrosa AL, Vieira CU, Guimarães CT, Bartholomeu DC, Oliveira DM, Santos FR, Rabelo ÉM, Lobo FP, Franco GR, Costa AF, Castro IM, Dias SRC, Ferro JA, Ortega JM, Paiva LV, Goulart LR, Almeida JF, Ferro MIT, Carneiro NP, Falcão PRK, Grynberg P, Teixeira SMR, Brommonschenkel S, Oliveira SC, Meyer R, Moore RJ, Miyoshi A, Oliveira GC, Azevedo V. Evidence for reductive genome evolution and lateral acquisition of virulence functions in two Corynebacterium pseudotuberculosis strains. PLoS One 2011; 6:e18551. [PMID: 21533164 PMCID: PMC3078919 DOI: 10.1371/journal.pone.0018551] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2010] [Accepted: 03/11/2011] [Indexed: 02/02/2023] Open
Abstract
Background Corynebacterium pseudotuberculosis, a Gram-positive, facultative intracellular pathogen, is the etiologic agent of the disease known as caseous lymphadenitis (CL). CL mainly affects small ruminants, such as goats and sheep; it also causes infections in humans, though rarely. This species is distributed worldwide, but it has the most serious economic impact in Oceania, Africa and South America. Although C. pseudotuberculosis causes major health and productivity problems for livestock, little is known about the molecular basis of its pathogenicity. Methodology and Findings We characterized two C. pseudotuberculosis genomes (Cp1002, isolated from goats; and CpC231, isolated from sheep). Analysis of the predicted genomes showed high similarity in genomic architecture, gene content and genetic order. When C. pseudotuberculosis was compared with other Corynebacterium species, it became evident that this pathogenic species has lost numerous genes, resulting in one of the smallest genomes in the genus. Other differences that could be part of the adaptation to pathogenicity include a lower GC content, of about 52%, and a reduced gene repertoire. The C. pseudotuberculosis genome also includes seven putative pathogenicity islands, which contain several classical virulence factors, including genes for fimbrial subunits, adhesion factors, iron uptake and secreted toxins. Additionally, all of the virulence factors in the islands have characteristics that indicate horizontal transfer. Conclusions These particular genome characteristics of C. pseudotuberculosis, as well as its acquired virulence factors in pathogenicity islands, provide evidence of its lifestyle and of the pathogenicity pathways used by this pathogen in the infection process. All genomes cited in this study are available in the NCBI Genbank database (http://www.ncbi.nlm.nih.gov/genbank/) under accession numbers CP001809 and CP001829.
Collapse
Affiliation(s)
- Jerônimo C. Ruiz
- Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
| | - Vívian D'Afonseca
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Artur Silva
- Department of Genetics, Federal University of Pará, Belém, Pará, Brazil
| | - Amjad Ali
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Anne C. Pinto
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Anderson R. Santos
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Aryanne A. M. C. Rocha
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Débora O. Lopes
- Health Sciences Center, Federal University of São João Del Rei, Divinópilis, Minas Gerais, Brazil
| | - Fernanda A. Dorella
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Luis G. C. Pacheco
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Department of Biointeraction Sciences, Federal University of Bahia, Salvador, Bahia, Brazil
| | - Marcília P. Costa
- Department of Veterinary Medicine, State University of Ceará, Fortaleza, Ceará, Brazil
| | - Meritxell Z. Turk
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Núbia Seyffert
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Pablo M. R. O. Moraes
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Siomar C. Soares
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Sintia S. Almeida
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Thiago L. P. Castro
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Vinicius A. C. Abreu
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Eva Trost
- Department of Genetics, University of Bielefeld, CeBiTech, Bielefeld, Nordrhein-Westfale, Germany
| | - Jan Baumbach
- Department of Computer Science, Max-Planck-Institut für Informatik, Saarbrücken, Saarlan, Germany
| | - Andreas Tauch
- Department of Genetics, University of Bielefeld, CeBiTech, Bielefeld, Nordrhein-Westfale, Germany
| | | | - John McCulloch
- Department of Genetics, Federal University of Pará, Belém, Pará, Brazil
| | | | | | - Adhemar Zerlotini
- Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
| | - Anderson Dominitini
- Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
| | - Daniela M. Resende
- Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
- Department of Pharmaceutical Sciences, Federal University of Ouro Preto, Ouro Preto, Minas Gerais, Brazil
| | - Elisângela M. Coser
- Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
| | - Luciana M. Oliveira
- Department of Phisics, Federal University of Ouro Preto, Ouro Preto, Minas Gerais, Brazil
| | - André L. Pedrosa
- Department of Pharmaceutical Sciences, Federal University of Ouro Preto, Ouro Preto, Minas Gerais, Brazil
- Department of Biological Sciences, Federal University of Triangulo Mineiro, Uberaba, Minas Gerais, Brazil
| | - Carlos U. Vieira
- Department of Genetics and Biochemistry, Federal University of Uberlândia, Uberlândia, Minas Gerais, Brazil
| | - Cláudia T. Guimarães
- Brazilian Agricultural Research Corporation (EMBRAPA), Sete Lagoas, Minas Gerais, Brazil
| | - Daniela C. Bartholomeu
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Diana M. Oliveira
- Department of Veterinary Medicine, State University of Ceará, Fortaleza, Ceará, Brazil
| | - Fabrício R. Santos
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Élida Mara Rabelo
- Department of Parasitology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Francisco P. Lobo
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Glória R. Franco
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Ana Flávia Costa
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Ieso M. Castro
- Department of Pharmacy, Federal University of Ouro Preto, Ouro Preto, Minas Gerais, Brazil
| | - Sílvia Regina Costa Dias
- Department of Parasitology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Jesus A. Ferro
- Department of Technology, State University of São Paulo, Jaboticabal, São Paulo, Brazil
| | - José Miguel Ortega
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Luciano V. Paiva
- Department of Chemistry, Federal University of Lavras, Lavras, Minas Gerais, Brazil
| | - Luiz R. Goulart
- Department of Genetics and Biochemistry, Federal University of Uberlândia, Uberlândia, Minas Gerais, Brazil
| | - Juliana Franco Almeida
- Department of Genetics and Biochemistry, Federal University of Uberlândia, Uberlândia, Minas Gerais, Brazil
| | - Maria Inês T. Ferro
- Department of Technology, State University of São Paulo, Jaboticabal, São Paulo, Brazil
| | - Newton P. Carneiro
- Brazilian Agricultural Research Corporation (EMBRAPA), Sete Lagoas, Minas Gerais, Brazil
| | - Paula R. K. Falcão
- Brazilian Agricultural Research Corporation (EMBRAPA), Campinas, São Paulo, Brazil
| | - Priscila Grynberg
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Santuza M. R. Teixeira
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Sérgio Brommonschenkel
- Department of Plant Pathology, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
| | - Sérgio C. Oliveira
- Department of Biochemistry and Immunology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Roberto Meyer
- Department of Biointeraction Sciences, Federal University of Bahia, Salvador, Bahia, Brazil
| | | | - Anderson Miyoshi
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Guilherme C. Oliveira
- Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
- Center of Excellence in Bioinformatics, National Institute of Science and Technology, Research Center René Rachou, Oswaldo Cruz Foundation, Belo Horizonte, Minas Gerais, Brazil
| | - Vasco Azevedo
- Department of General Biology, Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- * E-mail:
| |
Collapse
|
26
|
|
27
|
|
28
|
Hudson DM, Mattatall NR, Uribe E, Richards RC, Gong H, Ewart KV. Cystine-mediated oligomerization of the Atlantic salmon serum C-type lectin. Biochim Biophys Acta 2011; 1814:283-9. [PMID: 21109028 DOI: 10.1016/j.bbapap.2010.11.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2010] [Revised: 11/08/2010] [Accepted: 11/10/2010] [Indexed: 11/20/2022]
Abstract
The Atlantic salmon (Salmo salar) serum lectin (SSL) is a C-type lectin that binds to bacteria including salmon pathogens. SSL has been shown to be oligomeric in salmon serum and it displays a stoichiometric band-laddering pattern when analyzed by SDS-PAGE under non-reducing conditions. In this study, a model was generated for SSL isoform 2 in silico in order to identify cysteines that are available to form intermolecular disulfide bonds facilitating oligomerization. Then, recombinant SSL was expressed in E. coli and mutants were produced at positions Cys72 and Cys149. The SSL preparations were purified by metal-affinity chromatography and shown to be functional by carbohydrate-affinity chromatography. The recombinant SSL formed oligomers, which were evident by non-reducing covalent cross-linking and non-reducing SDS-PAGE; however, the band patterns were different for the mutants, with the maximal and predominant multimer sizes distinct from the wild-type recombinant lectin. Further examination of oligomerization by size exclusion chromatography revealed a subunit number from 35 to at least 110 for the wild-type recombinant SSL and subunit numbers below 9 for each mutant SSL oligomer. Thus, both cysteines were found to contribute to oligomerization of SSL.
Collapse
|
29
|
Cukrowska B, Motyl I, Kozáková H, Schwarzer M, Górecki RK, Klewicka E, Śliżewska K, Libudzisz Z. Probiotic Lactobacillus strains: in vitro and in vivo studies. Folia Microbiol (Praha) 2010; 54:533-7. [DOI: 10.1007/s12223-009-0077-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2009] [Revised: 11/19/2009] [Indexed: 11/29/2022]
|
30
|
|
31
|
Tatar S, Cicekli I. Two learning approaches for protein name extraction. J Biomed Inform 2009; 42:1046-55. [DOI: 10.1016/j.jbi.2009.05.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Accepted: 05/07/2009] [Indexed: 10/20/2022]
|
32
|
Bobby P, Balaji S, Sathyanath V, Eapen SJ. JUZBOX: a web server for extracting biomedical words from the protein sequence. Bioinformation 2009; 4:179-81. [PMID: 20461154 PMCID: PMC2859571 DOI: 10.6026/97320630004179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2009] [Revised: 07/31/2009] [Accepted: 09/11/2009] [Indexed: 11/25/2022] Open
Abstract
The recognition of gene/protein names in literature is one of the pivotal steps in the processing of biological literatures for information extraction
or data mining. We have compiled a lexicon of biomedical words (conserved patterns/ potential motifs) which has the combination of only 20
alphabets of amino acids. The remaining 6 letters of the English alphabets (B, J, O, U, X, Z) are treated as invalid amino acid characters (to our
context), We have jumbled the 6 letters for the sake of usage and convenience and termed as ’JUZBOX‘ and these characters were filtered in the
biomedical lexicon. Undoubtedly, the generation of biomedical words from protein sequence using JUZBOX have applications specific for
functional annotation.
Collapse
Affiliation(s)
- Paul Bobby
- Indian Institute of Spices Research, Calicut, Kerala, India
| | | | | | | |
Collapse
|
33
|
Abstract
We present a tunable, machine vision-based strategy for automated annotation of virtual small molecule databases. The proposed strategy is based on the use of a machine vision-based tool for extracting structure diagrams in research articles and converting them into connection tables, a virtual "Chemical Expert" system for screening the converted structures based on the adjustable levels of estimated conversion accuracy, and a fragment-based measure for calculating intermolecular similarity. For annotation, calculated chemical similarity between the converted structures and entries in a virtual small molecule database is used to establish the links. The overall annotation performances can be tuned by adjusting the cutoff threshold of the estimated conversion accuracy. We perform an annotation test which attempts to link 121 journal articles registered in PubMed to entries in PubChem which is the largest, publicly accessible chemical database. Two cases of tests are performed, and their results are compared to see how the overall annotation performances are affected by the different threshold levels of the estimated accuracy of the converted structure. Our work demonstrates that over 45% of the articles could have true positive links to entries in the PubChem database with promising recall and precision rates in both tests. Furthermore, we illustrate that the Chemical Expert system which can screen converted structures based on the adjustable levels of estimated conversion accuracy is a key factor impacting the overall annotation performance. We propose that this machine vision-based strategy can be incorporated with the text-mining approach to facilitate extraction of contextual scientific knowledge about a chemical structure, from the scientific literature.
Collapse
Affiliation(s)
- Jungkap Park
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, Michigan 48109, ,
| | - Gus R. Rosania
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, Michigan 48109,
| | - Kazuhiro Saitou
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, Michigan 48109, ,
| |
Collapse
|
34
|
Huang KC, Geller J, Halper M, Perl Y, Xu J. Using WordNet synonym substitution to enhance UMLS source integration. Artif Intell Med 2009; 46:97-109. [PMID: 19117739 PMCID: PMC2755556 DOI: 10.1016/j.artmed.2008.11.008] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2008] [Revised: 08/15/2008] [Accepted: 11/09/2008] [Indexed: 11/21/2022]
Abstract
OBJECTIVE Synonym-substitution algorithms have been developed for the purpose of matching source vocabulary terms with existing Unified Medical Language System (UMLS) terms during the integration process. A drawback is the possible explosion in the number of newly generated (potential) synonyms, which can tax computational and expert review resources. Experiments are run using a synonym-substitution approach based on WordNet to see how constraining two methodological parameters, namely, "maximum number of substitutions per term" and "maximum term length," affects performance. Our hypothesis is that these values can be constrained rather tightly--thus greatly speeding up the methodology--without a marked decline in the additional matches produced. Furthermore, we investigate whether a limitation on only the first of the two parameters is sufficient to achieve the same results. METHODS A four-stage synonym-substitution methodology using WordNet is presented. A group of experiments is carried out in which the two methodological parameters "maximum number of substitutions per term" and "maximum term length" are varied. The purpose is to examine their effect on the growth in the number of potential synonyms generated and the associated loss of results. The experiments are based on the re-integration of the "Minimal Standard Terminology" (MST) into the UMLS. Synonym-substitution matches found to be inconsistent with the current content of the UMLS and thus deemed to be incorrect are further manually scrutinized as an audit of the original integration of the MST. RESULTS An increase of 11% in the number of "MST term/UMLS term" matches was achieved using the synonym-substitution methodology. Importantly, this result prevailed when tight threshold values (such as a maximum of two synonym substitutions per term) were imposed on the parameters. Furthermore, it was found that limiting only the "maximum number of substitutions per term" parameter was sufficient to obtain the performance enhancement. During the additional audit phase, a number of the reported mismatches were actually seen to be correct, representing an additional 10% increase in the number of matches obtained. CONCLUSION A synonym-substitution methodology that utilizes WordNet is a useful automated aide in UMLS source integration. Experiments showed that there was a significant speed-up but no degradation in match results when the methodology's "maximum number of substitutions per term" parameter was relatively tightly constrained. The methodology also helped to discover errors in the MST's original integration, and improve the quality of the UMLS's conceptual content.
Collapse
Affiliation(s)
- Kuo-Chuan Huang
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102-1982, USA.
| | | | | | | | | |
Collapse
|
35
|
|
36
|
Abstract
Background One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach. Results We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS. Conclusions The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.
Collapse
|
37
|
|
38
|
Chae JM, Oh HB, Choi SE, Cha CH, Kim MH, Jung SY. [Development of a system for extracting the information of candidate tumor markers reported in biomedical literatures]. Korean J Lab Med 2008; 28:79-87. [PMID: 18309259 DOI: 10.3343/kjlm.2008.28.1.79] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Since the human genome project was completed in 2003, there have been numerous reports on cancer and related markers. This study was aimed to develop a system to extract automatically information regarding the relationship between cancer and tumor markers from biomedical literatures. METHODS Named entities of tumor markers were recognized by both a dictionary-based method and machine learning technology of the support vector machine. Named entities of cancers were recognized by the MeSH dictionary. RESULTS Relational and filtering keywords were selected after annotating 160 abstracts from PubMed. Relational information was extracted only when one of the relational keywords was in an appropriate position along the parse tree of a sentence with both tumor marker and disease entities. The performance of the system developed in this study was evaluated with another set of 77 abstracts. With the relational and filtering keyword used in the system, precision was 94.38% and recall was 66.14%, while without the expert knowledge precision was 49.16% and recall was 69.29%. CONCLUSIONS We developed a system that can extract relational information between a tumor and its markers by incorporating expert knowledge into the system. The system exploiting expert knowledge would serve as a reference when developing another information extraction system in various medical fields.
Collapse
Affiliation(s)
- Jeong Min Chae
- Department of Computer Science Education, Korea University, Seoul, Korea
| | | | | | | | | | | |
Collapse
|
39
|
Furlong LI, Dach H, Hofmann-Apitius M, Sanz F. OSIRISv1.2: a named entity recognition system for sequence variants of genes in biomedical literature. BMC Bioinformatics 2008; 9:84. [PMID: 18251998 PMCID: PMC2277400 DOI: 10.1186/1471-2105-9-84] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Accepted: 02/05/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Single Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the variations are generally found in biomedical literature. The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities. Thus, automatic systems for the identification of citations of allelic variants of genes in biomedical texts are required. RESULTS Our group has previously reported the development of OSIRIS, a system aimed at the retrieval of literature about allelic variants of genes http://ibi.imim.es/osirisform.html. Here we describe the development of a new version of OSIRIS (OSIRISv1.2, http://ibi.imim.es/OSIRISv1.2.html) which incorporates a new entity recognition module and is built on top of a local mirror of the MEDLINE collection and HgenetInfoDB: a database that collects data on human gene sequence variations. The new entity recognition module is based on a pattern-based search algorithm for the identification of variation terms in the texts and their mapping to dbSNP identifiers. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in 99% precision, 82% recall, and an F-score of 0.89. As an example, the application of the system for collecting literature citations for the allelic variants of genes related to the diseases intracranial aneurysm and breast cancer is presented. CONCLUSION OSIRISv1.2 can be used to link literature references to dbSNP database entries with high accuracy, and therefore is suitable for collecting current knowledge on gene sequence variations and supporting the functional annotation of variation databases. The application of OSIRISv1.2 in combination with controlled vocabularies like MeSH provides a way to identify associations of biomedical interest, such as those that relate SNPs with diseases.
Collapse
Affiliation(s)
- Laura I Furlong
- Research Unit on Biomedical Informatics (GRIB), IMIM, UPF, PRBB, c/Dr, Aiguader 88, E-08003 Barcelona, Spain.
| | | | | | | |
Collapse
|
40
|
Soanes KH, Ewart KV, Mattatall NR. Recombinant production and characterization of the carbohydrate recognition domain from Atlantic salmon C-type lectin receptor C (SCLRC). Protein Expr Purif 2008; 59:38-46. [PMID: 18272393 DOI: 10.1016/j.pep.2008.01.001] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2007] [Revised: 01/08/2008] [Accepted: 01/09/2008] [Indexed: 11/20/2022]
Abstract
The Atlantic salmon C-type lectin receptor C (SCLRC) locus encodes a potential oligomeric type II receptor. C-type lectins recognize carbohydrates in a Ca(2+)-dependent manner through structurally conserved, yet functionally diverse, C-type lectin-like domains (CTLDs). Many conserved amino acids in animal CTLDs are present in SCLRC, with the notable exception of an asparagine crucially involved in Ca(2+)- and carbohydrate-binding, which is tyrosine in SCLRC. SCLRC also contains six cysteines that form three disulfide bonds. Although SCLRC was originally identified as an up-regulated transcript responding to Aeromonas salmonicida infection, the biological role of this protein is still unknown. To study the structure and ligand binding properties of SCLRC, we created a homology model of the 17kDa CTLD and produced it as an affinity-tagged protein in the periplasm of Escherichia coli by co-expression of proteins that facilitate disulfide bond formation. The recombinant form of SCLRC was characterized by a protease protection assay, a solid-phase carbohydrate-binding assay, and frontal affinity chromatography. On the basis of this characterization, we classify SCLRC as a C-type lectin that binds to mannose and its derivatives.
Collapse
|
41
|
Quiñones KD, Su H, Marshall B, Eggers S, Chen H. User-centered evaluation of Arizona BioPathway: an information extraction, integration, and visualization system. IEEE Trans Inf Technol Biomed 2007; 11:527-36. [PMID: 17912969 DOI: 10.1109/titb.2006.889706] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Explosive growth in biomedical research has made automated information extraction, knowledge integration, and visualization increasingly important and critically needed. The Arizona BioPathway (ABP) system extracts and displays biological regulatory pathway information from the abstracts of journal articles. This study uses relations extracted from more than 200 PubMed abstracts presented in a tabular and graphical user interface with built-in search and aggregation functionality. This paper presents a task-centered assessment of the usefulness and usability of the ABP system focusing on its relation aggregation and visualization functionalities. Results suggest that our graph-based visualization is more efficient in supporting pathway analysis tasks and is perceived as more useful and easier to use as compared to a text-based literature-viewing method. Relation aggregation significantly contributes to knowledge-acquisition efficiency. Together, the graphic and tabular views in the ABP Visualizer provide a flexible and effective interface for pathway relation browsing and analysis. Our study contributes to pathway-related research and biological information extraction by assessing the value of a multiview, relation-based interface that supports user-controlled exploration of pathway information across multiple granularities.
Collapse
Affiliation(s)
- Karin D Quiñones
- Department of Management Information Systems, University of Arizona, Tucson, AZ 85721, USA.
| | | | | | | | | |
Collapse
|
42
|
Tsuruoka Y, McNaught J, Tsujii J, Ananiadou S. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics 2007; 23:2768-74. [PMID: 17698493 DOI: 10.1093/bioinformatics/btm393] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed. RESULTS We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks. AVAILABILITY A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.
Collapse
Affiliation(s)
- Yoshimasa Tsuruoka
- School of Computer Science, The University of Manchester, Manchester, UK.
| | | | | | | |
Collapse
|
43
|
Abstract
Processing text from scientific literature has become a necessity due to the burgeoning amounts of information that are fast becoming available, stemming from advances in electronic information technology. We created a program, NeuroText ( http://senselab.med.yale.edu/textmine/neurotext.pl ), designed specifically to extract information relevant to neuroscience-specific databases, NeuronDB and CellPropDB ( http://senselab.med.yale.edu/senselab/ ), housed at the Yale University School of Medicine. NeuroText extracts relevant information from the Neuroscience literature in a two-step process: each step parses text at different levels of granularity. NeuroText uses an expert-mediated knowledge base and combines the techniques of indexing, contextual parsing, semantic and lexical parsing, and supervised and non-supervised learning to extract information. The constrains, metadata elements, and rules for information extraction are stored in the knowledge base. NeuroText was created as a pilot project to process 3 years of publications in Journal of Neuroscience and was subsequently tested for 40,000 PubMed abstracts. We also present here a template to create domain non-specific knowledge base that when linked to a text-processing tool like NeuroText can be used to extract knowledge in other fields of research.
Collapse
Affiliation(s)
- Chiquito J Crasto
- Yale Center for Medical Informatics and Department of Neurobiology, Yale University School of Medicine, New Haven, CT, USA
| | | |
Collapse
|
44
|
Tulipano PK, Tao Y, Millar WS, Zanzonico P, Kolbert K, Xu H, Yu H, Chen L, Lussier YA, Friedman C. Natural language processing and visualization in the molecular imaging domain. J Biomed Inform 2006; 40:270-81. [PMID: 17084109 DOI: 10.1016/j.jbi.2006.08.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2005] [Revised: 08/25/2006] [Accepted: 08/29/2006] [Indexed: 11/16/2022]
Abstract
Molecular imaging is at the crossroads of genomic sciences and medical imaging. Information within the molecular imaging literature could be used to link to genomic and imaging information resources and to organize and index images in a way that is potentially useful to researchers. A number of natural language processing (NLP) systems are available to automatically extract information from genomic literature. One existing NLP system, known as BioMedLEE, automatically extracts biological information consisting of biomolecular substances and phenotypic data. This paper focuses on the adaptation, evaluation, and application of BioMedLEE to the molecular imaging domain. In order to adapt BioMedLEE for this domain, we extend an existing molecular imaging terminology and incorporate it into BioMedLEE. BioMedLEE's performance is assessed with a formal evaluation study. The system's performance, measured as recall and precision, is 0.74 (95% CI: [.70-.76]) and 0.70 (95% CI [.63-.76]), respectively. We adapt a JAVA viewer known as PGviewer for the simultaneous visualization of images with NLP extracted information.
Collapse
Affiliation(s)
- P Karina Tulipano
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, Vanderbilt Clinic Floor 5, NY 10032, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
Abstract
MOTIVATION Abbreviations are an important type of terminology in the biomedical domain. Although several groups have already created databases of biomedical abbreviations, these are either not public, or are not comprehensive, or focus exclusively on acronym-type abbreviations. We have created another abbreviation database, ADAM, which covers commonly used abbreviations and their definitions (or long-forms) within MEDLINE titles and abstracts, including both acronym and non-acronym abbreviations. RESULTS A model of recognizing abbreviations and their long-forms from titles and abstracts of MEDLINE (2006 baseline) was employed. After grouping morphological variants, 59 405 abbreviation/long-form pairs were identified. ADAM shows high precision (97.4%) and includes most of the frequently used abbreviations contained in the Unified Medical Language System (UMLS) Lexicon and the Stanford Abbreviation Database. Conversely, one-third of abbreviations in ADAM are novel insofar as they are not included in either database. About 19% of the novel abbreviations are non-acronym-type and these cover at least seven different types of short-form/long-form pairs. AVAILABILITY A free, public query interface to ADAM is available at http://arrowsmith.psych.uic.edu, and the entire database can be downloaded as a text file.
Collapse
Affiliation(s)
- Wei Zhou
- Department of Psychiatry and Psychiatric Institute, MC912, University of Illinois at Chicago Chicago, IL 60612, USA
| | | | | |
Collapse
|
46
|
Abstract
Motivation The use or study of chemical compounds permeates almost every scientific field and in each of them, the amount of textual information is growing rapidly. There is a need to accurately identify chemical names within text for a number of informatics efforts such as database curation, report summarization, tagging of named entities and keywords, or the development/curation of reference databases. Results A first-order Markov Model (MM) was evaluated for its ability to distinguish chemical names from words, yielding ~93% recall in recognizing chemical terms and ~99% precision in rejecting non-chemical terms on smaller test sets. However, because total false-positive events increase with the number of words analyzed, the scalability of name recognition was measured by processing 13.1 million MEDLINE records. The method yielded precision ranges from 54.7% to 100%, depending upon the cutoff score used, averaging 82.7% for approximately 1.05 million putative chemical terms extracted. Extracted chemical terms were analyzed to estimate the number of spelling variants per term, which correlated with the total number of times the chemical name appeared in MEDLINE. This variability in term construction was found to affect both information retrieval and term mapping when using PubMed and Ovid.
Collapse
Affiliation(s)
- Jonathan D Wren
- Advanced Center for Genome Technology, Department of Botany and Microbiology, The University of Oklahoma, Norman, Oklahoma 73019, USA.
| |
Collapse
|
47
|
Abstract
MOTIVATION Recently, several information extraction systems have been developed to retrieve relevant information out of biomedical text. However, these methods represent individual efforts. In this paper, we show that by combining different algorithms and their outcome, the results improve significantly. For this reason, CONAN has been created, a system which combines different programs and their outcome. Its methods include tagging of gene/protein names, finding interaction and mutation data, tagging of biological concepts and linking to MeSH and Gene Ontology terms. RESULTS In this paper, we will present data that show that combining different text-mining algorithms significantly improves the results. Not only is CONAN a full-scale approach that will ultimately cover all of PubMed/MEDLINE, we also show that this universality has no effect on quality: our system performs as well as or better than existing systems. AVAILABILITY The LDD corpus presented is available by request to the author. The system will be available shortly. For information and updates on CONAN please visit http://www.cs.uu.nl/people/rainer/conan.html.
Collapse
Affiliation(s)
- Rainer Malik
- Universiteit Utrecht, Department of Information and Computing Sciences, Padualaan 14, 3584CH Utrecht, The Netherlands.
| | | | | |
Collapse
|
48
|
Abstract
For the average biologist, hands-on literature mining currently means a keyword search in PubMed. However, methods for extracting biomedical facts from the scientific literature have improved considerably, and the associated tools will probably soon be used in many laboratories to automatically annotate and analyse the growing number of system-wide experimental data sets. Owing to the increasing body of text and the open-access policies of many journals, literature mining is also becoming useful for both hypothesis generation and biological discovery. However, the latter will require the integration of literature and high-throughput data, which should encourage close collaborations between biologists and computational linguists.
Collapse
Affiliation(s)
- Lars Juhl Jensen
- European Molecular Biology Laboratory, D-69117 Heidelberg, Germany.
| | | | | |
Collapse
|
49
|
Abstract
Nucleoli are plurifunctional nuclear domains involved in the regulation of several major cellular processes such as ribosome biogenesis, the biogenesis of non-ribosomal ribonucleoprotein complexes, cell cycle, and cellular aging. Until recently, the protein content of nucleoli was poorly described. Several proteomic analyses have been undertaken to discover the molecular bases of the biological roles fulfilled by nucleoli. These studies have led to the identification of more than 700 proteins. Extensive bibliographic and bioinformatic analyses allowed the classification of the identified proteins into functional groups and suggested potential functions of 150 human proteins previously uncharacterized. The combination of improvements in mass spectrometry technologies, the characterization of protein complexes, and data mining will assist in furthering our understanding of the role of nucleoli in different physiological and pathological cell states.
Collapse
Affiliation(s)
- Yohann Couté
- Biomedical Proteomics Research Group, Département de Biologie Structurale et Bioinformatique, Centre Médical Universitaire, 1 Rue Michel Servet, 1211 Geneva 14, Switzerland.
| | | | | | | | | | | | | |
Collapse
|
50
|
Natarajan J, Berrar D, Hack CJ, Dubitzky W. Knowledge discovery in biology and biotechnology texts: a review of techniques, evaluation strategies, and applications. Crit Rev Biotechnol 2005; 25:31-52. [PMID: 15999851 DOI: 10.1080/07388550590935571] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Arguably, the richest source of knowledge (as opposed to fact and data collections) about biology and biotechnology is captured in natural-language documents such as technical reports, conference proceedings and research articles. The automatic exploitation of this rich knowledge base for decision making, hypothesis management (generation and testing) and knowledge discovery constitutes a formidable challenge. Recently, a set of technologies collectively referred to as knowledge discovery in text (KDT) has been advocated as a promising approach to tackle this challenge. KDT comprises three main tasks: information retrieval, information extraction and text mining. These tasks are the focus of much recent scientific research and many algorithms have been developed and applied to documents and text in biology and biotechnology. This article introduces the basic concepts of KDT, provides an overview of some of these efforts in the field of bioscience and biotechnology, and presents a framework of commonly used techniques for evaluating KDT methods, tools and systems.
Collapse
Affiliation(s)
- J Natarajan
- University of Ulster, School of Biomedical Sciences, Bioinformatics Research Group, Coleraine BT52 1SA, Northern Ireland
| | | | | | | |
Collapse
|