1
|
Gordils-Valentin L, Ouyang H, Qian L, Hong J, Zhu X. Conjugative type IV secretion systems enable bacterial antagonism that operates independently of plasmid transfer. Commun Biol 2024; 7:499. [PMID: 38664513 PMCID: PMC11045733 DOI: 10.1038/s42003-024-06192-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 04/15/2024] [Indexed: 04/28/2024] Open
Abstract
Bacterial cooperation and antagonism mediated by secretion systems are among the ways in which bacteria interact with one another. Here we report the discovery of an antagonistic property of a type IV secretion system (T4SS) sourced from a conjugative plasmid, RP4, using engineering approaches. We scrutinized the genetic determinants and suggested that this antagonistic activity is independent of molecular cargos, while we also elucidated the resistance genes. We further showed that a range of Gram-negative bacteria and a mixed bacterial population can be eliminated by this T4SS-dependent antagonism. Finally, we showed that such an antagonistic property is not limited to T4SS sourced from RP4, rather it can also be observed in a T4SS originated from another conjugative plasmid, namely R388. Our results are the first demonstration of conjugative T4SS-dependent antagonism between Gram-negative bacteria on the genetic level and provide the foundation for future mechanistic studies.
Collapse
Affiliation(s)
- Lois Gordils-Valentin
- Department of Chemical Engineering, Texas A&M University, College Station, 77843, TX, US
- Interdisciplinary Graduate Program in Genetics & Genomics, Texas A&M University, College Station, 77843, TX, US
| | - Huanrong Ouyang
- Department of Chemical Engineering, Texas A&M University, College Station, 77843, TX, US
| | - Liangyu Qian
- Department of Chemical Engineering, Texas A&M University, College Station, 77843, TX, US
| | - Joshua Hong
- Department of Biology, Texas A&M University, College Station, 77843, TX, US
| | - Xuejun Zhu
- Department of Chemical Engineering, Texas A&M University, College Station, 77843, TX, US.
- Interdisciplinary Graduate Program in Genetics & Genomics, Texas A&M University, College Station, 77843, TX, US.
| |
Collapse
|
2
|
Wu H, Wang M, Wu J, Francis F, Chang YH, Shavick A, Dong H, Poon MTC, Fitzpatrick N, Levine AP, Slater LT, Handy A, Karwath A, Gkoutos GV, Chelala C, Shah AD, Stewart R, Collier N, Alex B, Whiteley W, Sudlow C, Roberts A, Dobson RJB. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. NPJ Digit Med 2022; 5:186. [PMID: 36544046 PMCID: PMC9770568 DOI: 10.1038/s41746-022-00730-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 11/29/2022] [Indexed: 12/24/2022] Open
Abstract
Much of the knowledge and information needed for enabling high-quality clinical research is stored in free-text format. Natural language processing (NLP) has been used to extract information from these sources at scale for several decades. This paper aims to present a comprehensive review of clinical NLP for the past 15 years in the UK to identify the community, depict its evolution, analyse methodologies and applications, and identify the main barriers. We collect a dataset of clinical NLP projects (n = 94; £ = 41.97 m) funded by UK funders or the European Union's funding programmes. Additionally, we extract details on 9 funders, 137 organisations, 139 persons and 431 research papers. Networks are created from timestamped data interlinking all entities, and network analysis is subsequently applied to generate insights. 431 publications are identified as part of a literature review, of which 107 are eligible for final analysis. Results show, not surprisingly, clinical NLP in the UK has increased substantially in the last 15 years: the total budget in the period of 2019-2022 was 80 times that of 2007-2010. However, the effort is required to deepen areas such as disease (sub-)phenotyping and broaden application domains. There is also a need to improve links between academia and industry and enable deployments in real-world settings for the realisation of clinical NLP's great potential in care delivery. The major barriers include research and development access to hospital data, lack of capable computational resources in the right places, the scarcity of labelled data and barriers to sharing of pretrained models.
Collapse
Affiliation(s)
- Honghan Wu
- Institute of Health Informatics, University College London, London, UK.
| | - Minhong Wang
- Institute of Health Informatics, University College London, London, UK
| | - Jinge Wu
- Institute of Health Informatics, University College London, London, UK
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Farah Francis
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Yun-Hsuan Chang
- Institute of Health Informatics, University College London, London, UK
| | - Alex Shavick
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Hang Dong
- Usher Institute, University of Edinburgh, Edinburgh, UK
- Department of Computer Science, University of Oxford, Oxford, UK
| | | | | | - Adam P Levine
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Luke T Slater
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Alex Handy
- Institute of Health Informatics, University College London, London, UK
- University College London Hospitals NHS Trust, London, UK
| | - Andreas Karwath
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- Institute of Cancer and Genomics, University of Birmingham, Birmingham, UK
| | - Claude Chelala
- Centre for Tumour Biology, Barts Cancer Institute, Queen Mary University of London, London, UK
| | - Anoop Dinesh Shah
- Institute of Health Informatics, University College London, London, UK
| | - Robert Stewart
- Department of Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience (IoPPN), King's College London, London, UK
- South London and Maudsley NHS Foundation Trust, London, UK
| | - Nigel Collier
- Theoretical and Applied Linguistics, Faculty of Modern & Medieval Languages & Linguistics, University of Cambridge, Cambridge, UK
| | - Beatrice Alex
- Edinburgh Futures Institute, University of Edinburgh, Edinburgh, UK
| | | | - Cathie Sudlow
- Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Angus Roberts
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| | - Richard J B Dobson
- Institute of Health Informatics, University College London, London, UK
- Department of Biostatistics & Health Informatics, King's College London, London, UK
| |
Collapse
|
3
|
Chaix E, Deléger L, Bossy R, Nédellec C. Text mining tools for extracting information about microbial biodiversity in food. Food Microbiol 2018; 81:63-75. [PMID: 30910089 PMCID: PMC6460834 DOI: 10.1016/j.fm.2018.04.011] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Revised: 03/26/2018] [Accepted: 04/17/2018] [Indexed: 12/20/2022]
Abstract
Information on food microbial diversity is scattered across millions of scientific papers. Researchers need tools to assist their bibliographic search in such large collections. Text mining and knowledge engineering methods are useful to automatically and efficiently find relevant information in Life Science. This work describes how the Alvis text mining platform has been applied to a large collection of PubMed abstracts of scientific papers in the food microbiology domain. The information targeted by our work is microorganisms, their habitats and phenotypes. Two knowledge resources, the NCBI taxonomy and the OntoBiotope ontology were used to detect this information in texts. The result of the text mining process was indexed and is presented through the AlvisIR Food on-line semantic search engine. In this paper, we also show through two illustrative examples the great potential of this new tool to assist in studies on ecological diversity and the origin of microbial presence in food. We present new text-mining tools to extract information in food microbiology. The results of the extraction are available in an on-line semantic search engine. Taxa, habitats, phenotypes and links between them can be queried in PubMed abstracts. Text-mining tools could assist to browse past and recent scientific literature. Two use-cases are presented: fruit microbiota and spore-forming bacteria in food.
Collapse
Affiliation(s)
- Estelle Chaix
- MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France.
| | - Louise Deléger
- MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France
| | - Robert Bossy
- MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France
| | - Claire Nédellec
- MaIAGE, INRA, Université Paris-Saclay, 78350 Jouy-en-Josas, France.
| |
Collapse
|
4
|
Gong L, Yang R, Liu Q, Dong Z, Chen H, Yang G. A Dictionary-Based Approach for Identifying Biomedical Concepts. INT J PATTERN RECOGN 2017. [DOI: 10.1142/s021800141757004x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this research, we provided a dictionary-based approach for identifying biomedical concepts from the literature. The approach first crawled experimental corpus by E-utilities and built a concept dictionary. Then, we developed an algorithm called Variable-step Window Identification Algorithm (VWIA) for matching biomedical concepts based on preprocessing, POS tagging and the formation of phrase block. The approach could identify embedded biomedical concepts and new concepts, which could identify concepts more completely. The proposed approach obtain 95.0% F-measure overall for the test dataset. Thus, it is promising for the method of biomedical text mining.
Collapse
Affiliation(s)
- Lejun Gong
- Jiangsu Key Lab of Big Data Security and Intelligent Processing, Jiangsu High Technology Research Key Lab for Wireless Sensor Networks, College of Computer Science & Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P. R. China
| | - Ronggen Yang
- College of Intelligent Science and Control Engineering, Jinling Institute of Technology, Nanjing, 211169, P. R. China
| | - Quan Liu
- Jiangsu Key Lab of Big Data Security and Intelligent Processing, Jiangsu High Technology Research Key Lab for Wireless Sensor Networks, College of Computer Science & Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P. R. China
| | - Zhenjiang Dong
- Zhongxing Telecommunication Equipment Corporation, Shenzhen, 518057, P. R. China
| | - Hong Chen
- Zhongxing Telecommunication Equipment Corporation, Shenzhen, 518057, P. R. China
| | - Geng Yang
- Jiangsu Key Lab of Big Data Security and Intelligent Processing, Jiangsu High Technology Research Key Lab for Wireless Sensor Networks, College of Computer Science & Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P. R. China
| |
Collapse
|
5
|
Gillespie JJ, Kaur SJ, Rahman MS, Rennoll-Bankert K, Sears KT, Beier-Sexton M, Azad AF. Secretome of obligate intracellular Rickettsia. FEMS Microbiol Rev 2014; 39:47-80. [PMID: 25168200 DOI: 10.1111/1574-6976.12084] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
The genus Rickettsia (Alphaproteobacteria, Rickettsiales, Rickettsiaceae) is comprised of obligate intracellular parasites, with virulent species of interest both as causes of emerging infectious diseases and for their potential deployment as bioterrorism agents. Currently, there are no effective commercially available vaccines, with treatment limited primarily to tetracycline antibiotics, although others (e.g. josamycin, ciprofloxacin, chloramphenicol, and azithromycin) are also effective. Much of the recent research geared toward understanding mechanisms underlying rickettsial pathogenicity has centered on characterization of secreted proteins that directly engage eukaryotic cells. Herein, we review all aspects of the Rickettsia secretome, including six secretion systems, 19 characterized secretory proteins, and potential moonlighting proteins identified on surfaces of multiple Rickettsia species. Employing bioinformatics and phylogenomics, we present novel structural and functional insight on each secretion system. Unexpectedly, our investigation revealed that the majority of characterized secretory proteins have not been assigned to their cognate secretion pathways. Furthermore, for most secretion pathways, the requisite signal sequences mediating translocation are poorly understood. As a blueprint for all known routes of protein translocation into host cells, this resource will assist research aimed at uniting characterized secreted proteins with their apposite secretion pathways. Furthermore, our work will help in the identification of novel secreted proteins involved in rickettsial 'life on the inside'.
Collapse
Affiliation(s)
- Joseph J Gillespie
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Simran J Kaur
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - M Sayeedur Rahman
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Kristen Rennoll-Bankert
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Khandra T Sears
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Magda Beier-Sexton
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Abdu F Azad
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, USA
| |
Collapse
|
6
|
Bhatty M, Laverde Gomez JA, Christie PJ. The expanding bacterial type IV secretion lexicon. Res Microbiol 2013; 164:620-39. [PMID: 23542405 DOI: 10.1016/j.resmic.2013.03.012] [Citation(s) in RCA: 121] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 12/01/2012] [Accepted: 02/05/2013] [Indexed: 02/06/2023]
Abstract
The bacterial type IV secretion systems (T4SSs) comprise a biologically diverse group of translocation systems functioning to deliver DNA or protein substrates from donor to target cells generally by a mechanism dependent on establishment of direct cell-to-cell contact. Members of one T4SS subfamily, the conjugation systems, mediate the widespread and rapid dissemination of antibiotic resistance and virulence traits among bacterial pathogens. Members of a second subfamily, the effector translocators, are used by often medically-important pathogens to deliver effector proteins to eukaryotic target cells during the course of infection. Here we summarize our current understanding of the structural and functional diversity of T4SSs and of the evolutionary processes shaping this diversity. We compare mechanistic and architectural features of T4SSs from Gram-negative and -positive species. Finally, we introduce the concept of the 'minimized' T4SSs; these are systems composed of a conserved set of 5-6 subunits that are distributed among many Gram-positive and some Gram-negative species.
Collapse
Affiliation(s)
- Minny Bhatty
- Department of Microbiology and Molecular Genetics, University of Texas Medical School at Houston, 6431 Fannin, Houston, TX 77030, USA
| | | | | |
Collapse
|
7
|
Kim S, Kim W, Wei CH, Lu Z, Wilbur WJ. Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas042. [PMID: 23160415 PMCID: PMC3500521 DOI: 10.1093/database/bas042] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The Comparative Toxicogenomics Database (CTD) contains manually curated literature that describes chemical–gene interactions, chemical–disease relationships and gene–disease relationships. Finding articles containing this information is the first and an important step to assist manual curation efficiency. However, the complex nature of named entities and their relationships make it challenging to choose relevant articles. In this article, we introduce a machine learning framework for prioritizing CTD-relevant articles based on our prior system for the protein–protein interaction article classification task in BioCreative III. To address new challenges in the CTD task, we explore a new entity identification method for genes, chemicals and diseases. In addition, latent topics are analyzed and used as a feature type to overcome the small size of the training set. Applied to the BioCreative 2012 Triage dataset, our method achieved 0.8030 mean average precision (MAP) in the official runs, resulting in the top MAP system among participants. Integrated with PubTator, a Web interface for annotating biomedical literature, the proposed system also received a positive review from the CTD curation team.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | | | |
Collapse
|
8
|
Pyysalo S, Ohta T, Rak R, Sullivan D, Mao C, Wang C, Sobral B, Tsujii J, Ananiadou S. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011. BMC Bioinformatics 2012; 13 Suppl 11:S2. [PMID: 22759456 PMCID: PMC3384257 DOI: 10.1186/1471-2105-13-s11-s2] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.
Collapse
Affiliation(s)
- Sampo Pyysalo
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Tomoko Ohta
- Department of Computer Science, University of Tokyo, Tokyo, Japan
| | - Rafal Rak
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| | - Dan Sullivan
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunhong Mao
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Chunxia Wang
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | - Bruno Sobral
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, Virginia, USA
| | | | - Sophia Ananiadou
- School of Computer Science, University of Manchester, Manchester, UK
- National Centre for Text Mining, University of Manchester, Manchester, UK
| |
Collapse
|
9
|
Combined SVM-CRFs for biological named entity recognition with maximal bidirectional squeezing. PLoS One 2012; 7:e39230. [PMID: 22745720 PMCID: PMC3383748 DOI: 10.1371/journal.pone.0039230] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 05/21/2012] [Indexed: 11/25/2022] Open
Abstract
Biological named entity recognition, the identification of biological terms in text, is essential for biomedical information extraction. Machine learning-based approaches have been widely applied in this area. However, the recognition performance of current approaches could still be improved. Our novel approach is to combine support vector machines (SVMs) and conditional random fields (CRFs), which can complement and facilitate each other. During the hybrid process, we use SVM to separate biological terms from non-biological terms, before we use CRFs to determine the types of biological terms, which makes full use of the power of SVM as a binary-class classifier and the data-labeling capacity of CRFs. We then merge the results of SVM and CRFs. To remove any inconsistencies that might result from the merging, we develop a useful algorithm and apply two rules. To ensure biological terms with a maximum length are identified, we propose a maximal bidirectional squeezing approach that finds the longest term. We also add a positive gain to rare events to reinforce their probability and avoid bias. Our approach will also gradually extend the context so more contextual information can be included. We examined the performance of four approaches with GENIA corpus and JNLPBA04 data. The combination of SVM and CRFs improved performance. The macro-precision, macro-recall, and macro-F1 of the SVM-CRFs hybrid approach surpassed conventional SVM and CRFs. After applying the new algorithms, the macro-F1 reached 91.67% with the GENIA corpus and 84.04% with the JNLPBA04 data.
Collapse
|
10
|
PATRIC: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species. Infect Immun 2011; 79:4286-98. [PMID: 21896772 DOI: 10.1128/iai.00207-11] [Citation(s) in RCA: 201] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Funded by the National Institute of Allergy and Infectious Diseases, the Pathosystems Resource Integration Center (PATRIC) is a genomics-centric relational database and bioinformatics resource designed to assist scientists in infectious-disease research. Specifically, PATRIC provides scientists with (i) a comprehensive bacterial genomics database, (ii) a plethora of associated data relevant to genomic analysis, and (iii) an extensive suite of computational tools and platforms for bioinformatics analysis. While the primary aim of PATRIC is to advance the knowledge underlying the biology of human pathogens, all publicly available genome-scale data for bacteria are compiled and continually updated, thereby enabling comparative analyses to reveal the basis for differences between infectious free-living and commensal species. Herein we summarize the major features available at PATRIC, dividing the resources into two major categories: (i) organisms, genomes, and comparative genomics and (ii) recurrent integration of community-derived associated data. Additionally, we present two experimental designs typical of bacterial genomics research and report on the execution of both projects using only PATRIC data and tools. These applications encompass a broad range of the data and analysis tools available, illustrating practical uses of PATRIC for the biologist. Finally, a summary of PATRIC's outreach activities, collaborative endeavors, and future research directions is provided.
Collapse
|
11
|
Automatic extraction of microorganisms and their habitats from free text using text mining workflows. J Integr Bioinform 2011. [DOI: 10.1515/jib-2011-184] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary In this paper we illustrate the usage of text mining workflows to automatically extract instances of microorganisms and their habitats from free text; these entries can then be curated and added to different databases. To this end, we use a Conditional Random Field (CRF) based classifier, as part of the workflows, to extract the mention of microorganisms, habitats and the inter-relation between organisms and their habitats.Results indicate a good performance for extraction of microorganisms and the relation extraction aspects of the task (with a precision of over 80%), while habitat recognition is only moderate (a precision of about 65%). We also conjecture that pdf-to-text conversion can be quite noisy and this implicitly affects any sentence-based relation extraction algorithms.
Collapse
|