1
|
Alvarez L, Liu K, Ishida C, Sánchez-Pérez M, Wuchty S, Makse HA. Symmetries in metabolic networks of Escherichia coli. PNAS NEXUS 2025; 4:pgaf080. [PMID: 40144777 PMCID: PMC11937945 DOI: 10.1093/pnasnexus/pgaf080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Accepted: 02/08/2025] [Indexed: 03/28/2025]
Abstract
The over-representation of motifs was previously considered a viable definition of building blocks in biological networks. Here, we construct an alternative definition based on invariance properties of enzymes in metabolic networks of Escherichia coli. In particular, we consider input trees of each enzyme that bundle all metabolic reactions where information is transmitted. Isomorphisms of such input trees point to symmetric enzymes grouped in "fibers" of the metabolic network that process equivalent dynamics. Such groups of enzymes constitute an alternative concept of building blocks which can be systematically classified into topological types of input trees according to their complexity. In contrast to motifs and modules, enzymes in such fibers are not necessarily mutually connected but still can be functionally related. Our analysis finds novel varieties of building blocks that capture such symmetries in hitherto unknown "composite Fibonacci" fibers. Lending credence to their significance as fundamental building blocks, we observe that enzymes in fibers are functionally more homogeneous than their network motif and module counterparts, suggesting that fibers point to a novel way of building blocks that capture metabolic functionality on a topological level.
Collapse
Affiliation(s)
- Luis Alvarez
- Levich Institute and Department of Physics, City College of New York, New York, NY 10031, USA
| | - Kuang Liu
- Levich Institute and Department of Physics, City College of New York, New York, NY 10031, USA
| | - Cecilia Ishida
- Faculty of Medicine and Biomedical Sciences, Autonomous University of Chihuahua, Chihuahua 31125, Mexico
| | - Mishael Sánchez-Pérez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca 62210, Mexico
| | - Stefan Wuchty
- Department of Computer Science, University of Miami, Coral Gables, FL 33146, USA
- Department of Biology, University of Miami, Coral Gables, FL 33146, USA
- Institute of Data Science and Computing, University of Miami, Coral Gables, FL 33146, USA
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33136, USA
| | - Hernán A Makse
- Levich Institute and Department of Physics, City College of New York, New York, NY 10031, USA
| |
Collapse
|
2
|
Choudhary D, Foster KR, Uphoff S. The master regulator OxyR orchestrates bacterial oxidative stress response genes in space and time. Cell Syst 2024; 15:1033-1045.e6. [PMID: 39541985 DOI: 10.1016/j.cels.2024.10.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 07/10/2024] [Accepted: 10/21/2024] [Indexed: 11/17/2024]
Abstract
Bacteria employ diverse gene regulatory networks to survive stress, but deciphering the underlying logic of these complex networks has proved challenging. Here, we use time-resolved single-cell imaging to explore the functioning of the E. coli regulatory response to oxidative stress. We observe diverse gene expression dynamics within the network. However, by controlling for stress-induced growth-rate changes, we show that these patterns involve just three classes of regulation: downregulated genes, upregulated pulsatile genes, and gradually upregulated genes. The two upregulated classes are distinguished by differences in the binding of the transcription factor, OxyR, and appear to play distinct roles during stress protection. Pulsatile genes activate transiently in a few cells for initial protection of a group of cells, whereas gradually upregulated genes induce evenly, generating a lasting protection involving many cells. Our study shows how bacterial populations use simple regulatory principles to coordinate stress responses in space and time. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Divya Choudhary
- Department of Biochemistry, University of Oxford, Oxford, UK
| | - Kevin R Foster
- Department of Biochemistry, University of Oxford, Oxford, UK; Department of Biology, University of Oxford, Oxford, UK; Sir William Dunn School of Pathology, South Parks Road, Oxford OX1 3RE, UK.
| | - Stephan Uphoff
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
3
|
Varela-Vega A, Posada-Reyes AB, Méndez-Cruz CF. Automatic extraction of transcriptional regulatory interactions of bacteria from biomedical literature using a BERT-based approach. Database (Oxford) 2024; 2024:baae094. [PMID: 39213391 PMCID: PMC11363960 DOI: 10.1093/database/baae094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 08/09/2024] [Accepted: 08/14/2024] [Indexed: 09/04/2024]
Abstract
Transcriptional regulatory networks (TRNs) give a global view of the regulatory mechanisms of bacteria to respond to environmental signals. These networks are published in biological databases as a valuable resource for experimental and bioinformatics researchers. Despite the efforts to publish TRNs of diverse bacteria, many of them still lack one and many of the existing TRNs are incomplete. In addition, the manual extraction of information from biomedical literature ("literature curation") has been the traditional way to extract these networks, despite this being demanding and time-consuming. Recently, language models based on pretrained transformers have been used to extract relevant knowledge from biomedical literature. Moreover, the benefit of fine-tuning a large pretrained model with new limited data for a specific task ("transfer learning") opens roads to address new problems of biomedical information extraction. Here, to alleviate this lack of knowledge and assist literature curation, we present a new approach based on the Bidirectional Transformer for Language Understanding (BERT) architecture to classify transcriptional regulatory interactions of bacteria as a first step to extract TRNs from literature. The approach achieved a significant performance in a test dataset of sentences of Escherichia coli (F1-Score: 0.8685, Matthew's correlation coefficient: 0.8163). The examination of model predictions revealed that the model learned different ways to express the regulatory interaction. The approach was evaluated to extract a TRN of Salmonella using 264 complete articles. The evaluation showed that the approach was able to accurately extract 82% of the network and that it was able to extract interactions absent in curation data. To the best of our knowledge, the present study is the first effort to obtain a BERT-based approach to extract this specific kind of interaction. This approach is a starting point to address the limitations of reconstructing TRNs of bacteria and diseases of biological interest. Database URL: https://github.com/laigen-unam/BERT-trn-extraction.
Collapse
Affiliation(s)
- Alfredo Varela-Vega
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, UNAM, Av. Universidad S/N Col. Chamilpa, Cuernavaca, Morelos 62210, México
| | - Ali-Berenice Posada-Reyes
- Laboratorio de Microbiología, Inmunología y Salud Pública, Facultad de Estudios Superiores Cuautitlán, UNAM, Carretera Cuautitlán-Teoloyucan Km. 2.5, Xhala, Cuautitlán Izcalli, Estado de México 54714, México
| | - Carlos-Francisco Méndez-Cruz
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, UNAM, Av. Universidad S/N Col. Chamilpa, Cuernavaca, Morelos 62210, México
| |
Collapse
|
4
|
Santos-Zavaleta A, Salgado H, Gama-Castro S, Sánchez-Pérez M, Gómez-Romero L, Ledezma-Tejeida D, García-Sotelo JS, Alquicira-Hernández K, Muñiz-Rascado LJ, Peña-Loredo P, Ishida-Gutiérrez C, Velázquez-Ramírez DA, Del Moral-Chávez V, Bonavides-Martínez C, Méndez-Cruz CF, Galagan J, Collado-Vides J. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. Nucleic Acids Res 2020; 47:D212-D220. [PMID: 30395280 PMCID: PMC6324031 DOI: 10.1093/nar/gky1077] [Citation(s) in RCA: 234] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Accepted: 10/19/2018] [Indexed: 01/31/2023] Open
Abstract
RegulonDB, first published 20 years ago, is a comprehensive electronic resource about regulation of transcription initiation of Escherichia coli K-12 with decades of knowledge from classic molecular biology experiments, and recently also from high-throughput genomic methodologies. We curated the literature to keep RegulonDB up to date, and initiated curation of ChIP and gSELEX experiments. We estimate that current knowledge describes between 10% and 30% of the expected total number of transcription factor- gene regulatory interactions in E. coli. RegulonDB provides datasets for interactions for which there is no evidence that they affect expression, as well as expression datasets. We developed a proof of concept pipeline to merge binding and expression evidence to identify regulatory interactions. These datasets can be visualized in the RegulonDB JBrowse. We developed the Microbial Conditions Ontology with a controlled vocabulary for the minimal properties to reproduce an experiment, which contributes to integrate data from high throughput and classic literature. At a higher level of integration, we report Genetic Sensory-Response Units for 200 transcription factors, including their regulation at the metabolic level, and include summaries for 70 of them. Finally, we summarize our research with Natural language processing strategies to enhance our biocuration work.
Collapse
Affiliation(s)
- Alberto Santos-Zavaleta
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Mishael Sánchez-Pérez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Laura Gómez-Romero
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Daniela Ledezma-Tejeida
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | | | - Kevin Alquicira-Hernández
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Luis José Muñiz-Rascado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Pablo Peña-Loredo
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Cecilia Ishida-Gutiérrez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - David A Velázquez-Ramírez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Víctor Del Moral-Chávez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - César Bonavides-Martínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | | | - James Galagan
- Department of Biomedical Engineering, Boston University, Boston, MA, USA
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México.,Department of Biomedical Engineering, Boston University, Boston, MA, USA
| |
Collapse
|
5
|
Méndez-Cruz CF, Gama-Castro S, Mejía-Almonte C, Castillo-Villalba MP, Muñiz-Rascado LJ, Collado-Vides J. First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2017:4237584. [PMID: 29220462 PMCID: PMC5737074 DOI: 10.1093/database/bax070] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Accepted: 08/15/2017] [Indexed: 11/17/2022]
Abstract
The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality.
Collapse
Affiliation(s)
- Carlos-Francisco Méndez-Cruz
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Socorro Gama-Castro
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Citlalli Mejía-Almonte
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Marco-Polo Castillo-Villalba
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Luis-José Muñiz-Rascado
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Julio Collado-Vides
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| |
Collapse
|
6
|
Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Teixeira D, Michel PA, Ruch P, Gaudet P. Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt. Database (Oxford) 2018; 2018:5255187. [PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 10/04/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022]
Abstract
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Collapse
Affiliation(s)
- Aurore Britan
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Isabelle Cusin
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valérie Hinard
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Luc Mottin
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Emilie Pasche
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Julien Gobeill
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valentine Rech de Laval
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Anne Gleizes
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Daniel Teixeira
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pierre-André Michel
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Patrick Ruch
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pascale Gaudet
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| |
Collapse
|
7
|
Rinaldi F, Lithgow O, Gama-Castro S, Solano H, Lopez A, Muñiz Rascado LJ, Ishida-Gutiérrez C, Méndez-Cruz CF, Collado-Vides J. Strategies towards digital and semi-automated curation in RegulonDB. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3074784. [PMID: 28365731 PMCID: PMC5467564 DOI: 10.1093/database/bax012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Accepted: 01/30/2017] [Indexed: 02/03/2023]
Abstract
Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is, therefore, a promising avenue to help life science databases to cope with the deluge of novel information. In this article, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators.
Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated is being integrated in the system. We tested the module using three sets of scientific articles and six domain experts. All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich.,Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, Zurich 8050, Switzerland
| | - Oscar Lithgow
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Socorro Gama-Castro
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Hilda Solano
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Alejandra Lopez
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Luis José Muñiz Rascado
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Cecilia Ishida-Gutiérrez
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Carlos-Francisco Méndez-Cruz
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Julio Collado-Vides
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| |
Collapse
|
8
|
Balderas-Martínez YI, Rinaldi F, Contreras G, Solano-Lira H, Sánchez-Pérez M, Collado-Vides J, Selman M, Pardo A. Improving biocuration of microRNAs in diseases: a case study in idiopathic pulmonary fibrosis. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3748307. [PMID: 28605770 PMCID: PMC5467562 DOI: 10.1093/database/bax030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 03/25/2017] [Indexed: 12/24/2022]
Abstract
MicroRNAs (miRNAs) are small and non-coding RNA molecules that inhibit gene expression posttranscriptionally. They play important roles in several biological processes, and in recent years there has been an interest in studying how they are related to the pathogenesis of diseases. Although there are already some databases that contain information for miRNAs and their relation with illnesses, their curation represents a significant challenge due to the amount of information that is being generated every day. In particular, respiratory diseases are poorly documented in databases, despite the fact that they are of increasing concern regarding morbidity, mortality and economic impacts. In this work, we present the results that we obtained in the BioCreative Interactive Track (IAT), using a semiautomatic approach for improving biocuration of miRNAs related to diseases. Our procedures will be useful to complement databases that contain this type of information. We adapted the OntoGene text mining pipeline and the ODIN curation system in a full-text corpus of scientific publications concerning one specific respiratory disease: idiopathic pulmonary fibrosis, the most common and aggressive of the idiopathic interstitial cases of pneumonia. We curated 823 miRNA text snippets and found a total of 246 miRNAs related to this disease based on our semiautomatic approach with the system OntoGene/ODIN. The biocuration throughput improved by a factor of 12 compared with traditional manual biocuration. A significant advantage of our semiautomatic pipeline is that it can be applied to obtain the miRNAs of all the respiratory diseases and offers the possibility to be used for other illnesses. Database URL http://odin.ccg.unam.mx/ODIN/bc2015-miRNA/.
Collapse
Affiliation(s)
- Yalbi Itzel Balderas-Martínez
- Facultad de Ciencias, Departamento Biología Celular, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Exterior s/n, Coyoacán, CP 04510, Ciudad de México, CDMX, México.,CONACYT-INER Ismael Cosío Villegas, Departamento Investigación, Calzada de Tlalpan 4502 Sección XVI, Tlalpan, CP Ciudad de México, CDMX, México
| | - Fabio Rinaldi
- Swiss Institute of Bioinformatics and Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, CH-8050 Zurich, Switzerland.,Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Gabriela Contreras
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Hilda Solano-Lira
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Mishael Sánchez-Pérez
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Julio Collado-Vides
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Moisés Selman
- Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Dirección de Investigación Calzada de Tlalpan 4502 Sección XVI, Tlalpan, CP Ciudad de México, CDMX, México
| | - Annie Pardo
- Facultad de Ciencias, Departamento Biología Celular, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Exterior s/n, Coyoacán, CP 04510, Ciudad de México, CDMX, México
| |
Collapse
|
9
|
Burns GAPC, Dasigi P, de Waard A, Hovy EH. Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw122. [PMID: 27580922 PMCID: PMC5006090 DOI: 10.1093/database/baw122] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 08/04/2016] [Indexed: 12/20/2022]
Abstract
Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this study, we develop Natural Language Processing methods to locate, extract, and classify the individual passages of text from articles’ Results sections that refer to experimental data. In our domain of interest (molecular biology studies of cancer signal transduction pathways), individual articles may contain as many as 30 small-scale individual experiments describing a variety of findings, upon which authors base their overall research conclusions. Our system automatically classifies discourse segments in these texts into seven categories (fact, hypothesis, problem, goal, method, result, implication) with an F-score of 0.68. These segments describe the essential building blocks of scientific discourse to (i) provide context for each experiment, (ii) report experimental details and (iii) explain the data’s meaning in context. We evaluate our system on text passages from articles that were curated in molecular biology databases (the Pathway Logic Datum repository, the Molecular Interaction MINT and INTACT databases) linking individual experiments in articles to the type of assay used (coprecipitation, phosphorylation, translocation etc.). We use supervised machine learning techniques on text passages containing unambiguous references to experiments to obtain baseline F1 scores of 0.59 for MINT, 0.71 for INTACT and 0.63 for Pathway Logic. Although preliminary, these results support the notion that targeting information extraction methods to experimental results could provide accurate, automated methods for biocuration. We also suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases
Collapse
Affiliation(s)
- Gully A P C Burns
- Information Sciences Institute, Viterbi School of Engineering, University of Southern California, Marina del Rey, CA 90292, USA
| | - Pradeep Dasigi
- Carnegie Mellon University, Language Technologies Institute, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
| | | | - Eduard H Hovy
- Carnegie Mellon University, Language Technologies Institute, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
| |
Collapse
|
10
|
Rodriguez-Esteban R. Biocuration with insufficient resources and fixed timelines. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav116. [PMID: 26708987 PMCID: PMC4691339 DOI: 10.1093/database/bav116] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 11/17/2015] [Indexed: 11/14/2022]
Abstract
Biological curation, or biocuration, is often studied from the perspective of creating and maintaining databases that have the goal of mapping and tracking certain areas of biology. However, much biocuration is, in fact, dedicated to finite and time-limited projects in which insufficient resources demand trade-offs. This typically more ephemeral type of curation is nonetheless of importance in biomedical research. Here, I propose a framework to understand such restricted curation projects from the point of view of return on curation (ROC), value, efficiency and productivity. Moreover, I suggest general strategies to optimize these curation efforts, such as the ‘multiple strategies’ approach, as well as a metric called overhead that can be used in the context of managing curation resources.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center Basel, Basel 4070, Switzerland
| |
Collapse
|
11
|
Gama-Castro S, Salgado H, Santos-Zavaleta A, Ledezma-Tejeida D, Muñiz-Rascado L, García-Sotelo JS, Alquicira-Hernández K, Martínez-Flores I, Pannier L, Castro-Mondragón JA, Medina-Rivera A, Solano-Lira H, Bonavides-Martínez C, Pérez-Rueda E, Alquicira-Hernández S, Porrón-Sotelo L, López-Fuentes A, Hernández-Koutoucheva A, Del Moral-Chávez V, Rinaldi F, Collado-Vides J. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res 2015; 44:D133-43. [PMID: 26527724 PMCID: PMC4702833 DOI: 10.1093/nar/gkv1156] [Citation(s) in RCA: 342] [Impact Index Per Article: 34.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Accepted: 10/19/2015] [Indexed: 01/28/2023] Open
Abstract
RegulonDB (http://regulondb.ccg.unam.mx) is one of the most useful and important resources on bacterial gene regulation,as it integrates the scattered scientific knowledge of the best-characterized organism, Escherichia coli K-12, in a database that organizes large amounts of data. Its electronic format enables researchers to compare their results with the legacy of previous knowledge and supports bioinformatics tools and model building. Here, we summarize our progress with RegulonDB since our last Nucleic Acids Research publication describing RegulonDB, in 2013. In addition to maintaining curation up-to-date, we report a collection of 232 interactions with small RNAs affecting 192 genes, and the complete repertoire of 189 Elementary Genetic Sensory-Response units (GENSOR units), integrating the signal, regulatory interactions, and metabolic pathways they govern. These additions represent major progress to a higher level of understanding of regulated processes. We have updated the computationally predicted transcription factors, which total 304 (184 with experimental evidence and 120 from computational predictions); we updated our position-weight matrices and have included tools for clustering them in evolutionary families. We describe our semiautomatic strategy to accelerate curation, including datasets from high-throughput experiments, a novel coexpression distance to search for ‘neighborhood’ genes to known operons and regulons, and computational developments.
Collapse
Affiliation(s)
- Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Daniela Ledezma-Tejeida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Jair Santiago García-Sotelo
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Kevin Alquicira-Hernández
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Irma Martínez-Flores
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Lucia Pannier
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | | | - Alejandra Medina-Rivera
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Campus Juriquilla, Boulevard Juriquilla 3001, Juriquilla 76230, Santiago de Querétaro, QRO, Mexico
| | - Hilda Solano-Lira
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - César Bonavides-Martínez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Ernesto Pérez-Rueda
- Departamento de Microbiologia Molecular, IBT, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62100, Mexico
| | - Shirley Alquicira-Hernández
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Liliana Porrón-Sotelo
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Alejandra López-Fuentes
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Anastasia Hernández-Koutoucheva
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Víctor Del Moral-Chávez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Binzmühlestrasse 14, CH-8050 Zurich, Switzerland
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| |
Collapse
|
12
|
Every Site Counts: Submitting Transcription Factor-Binding Site Information through the CollecTF Portal. J Bacteriol 2015; 197:2454-7. [PMID: 26013488 DOI: 10.1128/jb.00031-15] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Experimentally verified transcription factor-binding sites represent an information-rich and highly applicable data type that aptly summarizes the results of time-consuming experiments and inference processes. Currently, there is no centralized repository for this type of data, which is routinely embedded in articles and extremely hard to mine. CollecTF provides the first standardized resource for submission and deposition of these data into the NCBI RefSeq database, maximizing its accessibility and prompting the community to adopt direct submission policies.
Collapse
|
13
|
Rinaldi F, Clematide S, Marques H, Ellendorff T, Romacker M, Rodriguez-Esteban R. OntoGene web services for biomedical text mining. BMC Bioinformatics 2014; 15 Suppl 14:S6. [PMID: 25472638 PMCID: PMC4255746 DOI: 10.1186/1471-2105-15-s14-s6] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges, with top ranked results in several of them.
Collapse
|