1
|
Ross KE, Bastian FB, Buys M, Cook CE, D’Eustachio P, Harrison M, Hermjakob H, Li D, Lord P, Natale DA, Peters B, Sternberg PW, Su AI, Thakur M, Thomas PD, Bateman A. Perspectives on tracking data reuse across biodata resources. BIOINFORMATICS ADVANCES 2024; 4:vbae057. [PMID: 38721398 PMCID: PMC11076920 DOI: 10.1093/bioadv/vbae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 03/13/2024] [Accepted: 04/11/2024] [Indexed: 06/14/2024]
Abstract
Motivation Data reuse is a common and vital practice in molecular biology and enables the knowledge gathered over recent decades to drive discovery and innovation in the life sciences. Much of this knowledge has been collated into molecular biology databases, such as UniProtKB, and these resources derive enormous value from sharing data among themselves. However, quantifying and documenting this kind of data reuse remains a challenge. Results The article reports on a one-day virtual workshop hosted by the UniProt Consortium in March 2023, attended by representatives from biodata resources, experts in data management, and NIH program managers. Workshop discussions focused on strategies for tracking data reuse, best practices for reusing data, and the challenges associated with data reuse and tracking. Surveys and discussions showed that data reuse is widespread, but critical information for reproducibility is sometimes lacking. Challenges include costs of tracking data reuse, tensions between tracking data and open sharing, restrictive licenses, and difficulties in tracking commercial data use. Recommendations that emerged from the discussion include: development of standardized formats for documenting data reuse, education about the obstacles posed by restrictive licenses, and continued recognition by funding agencies that data management is a critical activity that requires dedicated resources. Availability and implementation Summaries of survey results are available at: https://docs.google.com/forms/d/1j-VU2ifEKb9C-sW6l3ATB79dgHdRk5v_lESv2hawnso/viewanalytics (survey of data providers) and https://docs.google.com/forms/d/18WbJFutUd7qiZoEzbOytFYXSfWFT61hVce0vjvIwIjk/viewanalytics (survey of users).
Collapse
Affiliation(s)
- Karen E Ross
- Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, United States
| | - Frederic B Bastian
- Evolutionary Bioinformatics Group, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
| | | | | | - Peter D’Eustachio
- Department of Biochemistry & Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY 10012, United States
| | - Melissa Harrison
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Henning Hermjakob
- Molecular Systems, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Donghui Li
- Chan Zuckerberg Initiative, Redwood City, CA 94063, United States
| | - Phillip Lord
- School of Computing, Newcastle University, Newcastle upon Tyne NE4 5TG, United Kingdom
| | - Darren A Natale
- Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, United States
| | - Bjoern Peters
- Center for Vaccine Innovation, La Jolla Institute of Immunology, La Jolla, CA 92037, United States
| | - Paul W Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, United States
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, United States
| | - Matthew Thakur
- Data Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90089, United States
| | - Alex Bateman
- MSCB, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| |
Collapse
|
2
|
Zhang Q, Zheng W, Song Z, Zhang Q, Yang L, Wu J, Lin J, Xu G, Yu H. Machine Learning Enables Prediction of Pyrrolysyl-tRNA Synthetase Substrate Specificity. ACS Synth Biol 2023; 12:2403-2417. [PMID: 37486975 DOI: 10.1021/acssynbio.3c00225] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Knowledge about the substrate scope for a given enzyme is informative for elucidating biochemical pathways and also for expanding applications of the enzyme. However, no general methods are available to accurately predict the substrate specificity of an enzyme. Pyrrolysyl-tRNA synthetase (PylRS) is a powerful tool for incorporating various noncanonical amino acids (NCAAs) into proteins, which enabled us to probe, image, rationally engineer, and evolve protein structure and function. However, the incorporation of a new NCAA typically requires the selection of large libraries of PylRS with randomized mutations at active sites, and this process requires multiple rounds of selection for each new substrate. Therefore, a single aminoacyl-tRNA synthetase with broad substrate promiscuity is ideal to facilitate widespread applications of the genetic NCAA incorporation technique. Herein, machine learning models were developed to predict the substrate specificity of PylRS to accept novel NCAAs that could be incorporated into proteins by three PylRS mutants. The models were built from a training set of 285 unique enzyme-substrate pairs of three PylRS mutants including IFRS, BtaRS, and MFRS against 95 NCAAs. The best BaggingTree (BT) model was then used for virtually screening a NCAAs library containing 1474 phenylalanine, tyrosine, tryptophan, and alanine analogues, and 156 NCAAs were predicted to be accepted by at least one of the three PylRS mutants. Then, 27 NCAAs including 24 positive and 3 negative substrates were experimentally tested for their activities, and 20 of the 24 positive substrates showed weak or strong activity and were accepted by at least one PylRS mutant, among which 11 NCAAs were never reported to be incorporated into proteins before. Three negative substrates did not show any activity. Experimental results suggested that the BT model provides a three-class classification accuracy of 0.69 and a binary classification accuracy of 0.86. This study expanded the substrate scope of three PylRS variants and provided a framework for developing machine learning models to predict substrate specificity of other PylRS variants.
Collapse
Affiliation(s)
- Qunfeng Zhang
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Wenlong Zheng
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| | - Zhongdi Song
- Key Laboratory of Pollution Exposure and Health Intervention of Zhejiang Province, Interdisciplinary Research Academy, Zhejiang Shuren University, Hangzhou 310015, China
| | - Qiang Zhang
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Lirong Yang
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| | - Jianping Wu
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| | - Jianping Lin
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Gang Xu
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
| | - Haoran Yu
- Institute of Bioengineering, College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310027, Zhejiang, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Centre, Hangzhou 311200, Zhejiang, China
| |
Collapse
|
3
|
Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023; 143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]
Abstract
BACKGROUND Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA.
| | - Nourah M Salem
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Elizabeth K White
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Katherine J Sullivan
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Michael Bada
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Teri L Hernandez
- College of Nursing, Department of Medicine/Division of Endocrinology, Metabolism, & Diabetes, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Sonia M Leach
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| |
Collapse
|
4
|
Joshi P, Banerjee S, Hu X, Khade PM, Friedberg I. GOThresher: a program to remove annotation biases from protein function annotation datasets. Bioinformatics 2023; 39:6998200. [PMID: 36688705 DOI: 10.1093/bioinformatics/btad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 11/30/2022] [Accepted: 01/20/2023] [Indexed: 01/24/2023] Open
Abstract
MOTIVATION Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets. RESULTS We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases. AVAILABILITY AND IMPLEMENTATION GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Parnal Joshi
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - Sagnik Banerjee
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Statistics, Iowa State University, Ames, IA 50011, USA
| | - Xiao Hu
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - Pranav M Khade
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| | - Iddo Friedberg
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
5
|
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
Affiliation(s)
- Benjamin Goudey
- Corresponding author. Benjamin Goudey, School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010,
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| | - Karin Verspoor
- School of Computing Technologies, RMIT University Melbourne, Victoria, 3000
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| |
Collapse
|
6
|
Boguslav MR, Salem NM, White EK, Leach SM, Hunter LE. Identifying and classifying goals for scientific knowledge. BIOINFORMATICS ADVANCES 2021; 1:vbab012. [PMID: 34661112 PMCID: PMC8508177 DOI: 10.1093/bioadv/vbab012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2021] [Revised: 06/17/2021] [Indexed: 01/26/2023]
Abstract
MOTIVATION Science progresses by posing good questions, yet work in biomedical text mining has not focused on them much. We propose a novel idea for biomedical natural language processing: identifying and characterizing the questions stated in the biomedical literature. Formally, the task is to identify and characterize statements of ignorance, statements where scientific knowledge is missing or incomplete. The creation of such technology could have many significant impacts, from the training of PhD students to ranking publications and prioritizing funding based on particular questions of interest. The work presented here is intended as the first step towards these goals. RESULTS We present a novel ignorance taxonomy driven by the role statements of ignorance play in research, identifying specific goals for future scientific knowledge. Using this taxonomy and reliable annotation guidelines (inter-annotator agreement above 80%), we created a gold standard ignorance corpus of 60 full-text documents from the prenatal nutrition literature with over 10 000 annotations and used it to train classifiers that achieved over 0.80 F1 scores. AVAILABILITY AND IMPLEMENTATION Corpus and source code freely available for download at https://github.com/UCDenver-ccp/Ignorance-Question-Work. The source code is implemented in Python.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA,To whom correspondence should be addressed.
| | - Nourah M Salem
- Health Informatics Program, College of Health Solutions at Arizona State University, Phoenix, AZ 85004, USA
| | - Elizabeth K White
- Center for Genes, Environment and Health, National Jewish Health, Denver, CO 80206, USA
| | - Sonia M Leach
- Center for Genes, Environment and Health, National Jewish Health, Denver, CO 80206, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
7
|
Bastian FB, Roux J, Niknejad A, Comte A, Fonseca Costa SS, de Farias TM, Moretti S, Parmentier G, de Laval VR, Rosikiewicz M, Wollbrett J, Echchiki A, Escoriza A, Gharib WH, Gonzales-Porta M, Jarosz Y, Laurenczy B, Moret P, Person E, Roelli P, Sanjeev K, Seppey M, Robinson-Rechavi M. The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Res 2021; 49:D831-D847. [PMID: 33037820 PMCID: PMC7778977 DOI: 10.1093/nar/gkaa793] [Citation(s) in RCA: 113] [Impact Index Per Article: 28.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 08/24/2020] [Accepted: 09/15/2020] [Indexed: 01/24/2023] Open
Abstract
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced by integrating multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Curation includes very large datasets such as GTEx (re-annotation of samples as ‘healthy’ or not) as well as many small ones. Data are integrated and made comparable between species thanks to consistent data annotation and processing, and to calls of presence/absence of expression, along with expression scores. As a result, Bgee is capable of detecting the conditions of expression of any single gene, accommodating any data type and species. Bgee provides several tools for analyses, allowing, e.g., automated comparisons of gene expression patterns within and between species, retrieval of the prefered conditions of expression of any gene, or enrichment analyses of conditions with expression of sets of genes. Bgee release 14.1 includes 29 animal species, and is available at https://bgee.org/ and through its Bioconductor R package BgeeDB.
Collapse
Affiliation(s)
- Frederic B Bastian
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Julien Roux
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne Niknejad
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Aurélie Comte
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Sara S Fonseca Costa
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Tarcisio Mendes de Farias
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Sébastien Moretti
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gilles Parmentier
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Valentine Rech de Laval
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marta Rosikiewicz
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Julien Wollbrett
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Amina Echchiki
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Angélique Escoriza
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Walid H Gharib
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Mar Gonzales-Porta
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Yohan Jarosz
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Balazs Laurenczy
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Philippe Moret
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Emilie Person
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Patrick Roelli
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Komal Sanjeev
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Mathieu Seppey
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
8
|
Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]
|
9
|
Giglio M, Tauber R, Nadendla S, Munro J, Olley D, Ball S, Mitraka E, Schriml LM, Gaudet P, Hobbs ET, Erill I, Siegele DA, Hu JC, Mungall C, Chibucos MC. ECO, the Evidence & Conclusion Ontology: community standard for evidence information. Nucleic Acids Res 2020; 47:D1186-D1194. [PMID: 30407590 PMCID: PMC6323956 DOI: 10.1093/nar/gky1036] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Accepted: 10/16/2018] [Indexed: 12/03/2022] Open
Abstract
The Evidence and Conclusion Ontology (ECO) contains terms (classes) that describe types of evidence and assertion methods. ECO terms are used in the process of biocuration to capture the evidence that supports biological assertions (e.g. gene product X has function Y as supported by evidence Z). Capture of this information allows tracking of annotation provenance, establishment of quality control measures and query of evidence. ECO contains over 1500 terms and is in use by many leading biological resources including the Gene Ontology, UniProt and several model organism databases. ECO is continually being expanded and revised based on the needs of the biocuration community. The ontology is freely available for download from GitHub (https://github.com/evidenceontology/) or the project’s website (http://evidenceontology.org/). Users can request new terms or changes to existing terms through the project’s GitHub site. ECO is released into the public domain under CC0 1.0 Universal.
Collapse
Affiliation(s)
- Michelle Giglio
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Rebecca Tauber
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Suvarna Nadendla
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - James Munro
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Dustin Olley
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Shoshannah Ball
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Elvira Mitraka
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Lynn M Schriml
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Pascale Gaudet
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Elizabeth T Hobbs
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD 21250, USA
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD 21250, USA
| | - Deborah A Siegele
- Department of Biology, Texas A&M University, College Station, TX 77840, USA
| | - James C Hu
- Department of Biochemistry and Biophysics, Texas A&M University, College Station, TX 77840, USA
| | - Chris Mungall
- Molecular Ecosystems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Marcus C Chibucos
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| |
Collapse
|
10
|
Affiliation(s)
- Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
| | - Zbynek Prokop
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Centre for Clinical Research, St. Ann’s Hospital, 602 00 Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 625 00 Brno, Czech Republic
- International Centre for Clinical Research, St. Ann’s Hospital, 602 00 Brno, Czech Republic
| |
Collapse
|
11
|
Lokhande KB, Nagar S, Swamy KV. Molecular interaction studies of Deguelin and its derivatives with Cyclin D1 and Cyclin E in cancer cell signaling pathway: The computational approach. Sci Rep 2019; 9:1778. [PMID: 30741976 PMCID: PMC6370771 DOI: 10.1038/s41598-018-38332-6] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Accepted: 11/19/2018] [Indexed: 11/09/2022] Open
Abstract
Deguelin is a major active ingredient and principal component in several plants and it is a potential molecule to target proteins of cancer cell signaling pathway. As a complex natural extract, deguelin interacts with various molecular targets to exert its anti-tumor properties at nanomolar level. It induces cell apoptosis by blocking anti-apoptotic pathways, while inhibiting tumor cell multiplication and malignant transformation through p27-cyclin-E-pRb-E2F1- cell cycle control and HIF-1alphaVEGF antiangiogenic pathways. In silico studies of deguelin and its derivatives is performed to explore interactions with Cyclin D1 and Cyclin E to understand the molecular insights of derivatives with the receptors. Deguelin and its derivatives are minimized by Avogadro to achieve stable conformation. All docking simulation are performed with AutoDockVina and virtual screening of docked ligands are carried out based on binding energy and number of hydrogen bonds. Molecular dynamics (MD) and Simulation of Cyclin D1 and Cyclin E1 is performed for 100 ns and stable conformation is obtained at 78 ns and 19 ns respectively. Ligands thus obtained from docking studies may be probable target to inhibit cancer cell signaling pathways.
Collapse
Affiliation(s)
- Kiran Bharat Lokhande
- Bioinformatics Research Laboratory, Dr. D. Y. Patil Biotechnology and Bioinformatics Institute, Dr. D. Y. Patil Vidyapeeth, Pune, 411033, India
| | - Shuchi Nagar
- Bioinformatics Research Laboratory, Dr. D. Y. Patil Biotechnology and Bioinformatics Institute, Dr. D. Y. Patil Vidyapeeth, Pune, 411033, India
| | - K Venkateswara Swamy
- Bioinformatics Research Laboratory, Dr. D. Y. Patil Biotechnology and Bioinformatics Institute, Dr. D. Y. Patil Vidyapeeth, Pune, 411033, India.
| |
Collapse
|
12
|
Darden L, Kundu K, Pal LR, Moult J. Harnessing formal concepts of biological mechanism to analyze human disease. PLoS Comput Biol 2018; 14:e1006540. [PMID: 30586388 PMCID: PMC6306204 DOI: 10.1371/journal.pcbi.1006540] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Mechanism is a widely used concept in biology. In 2017, more than 10% of PubMed abstracts used the term. Therefore, searching for and reasoning about mechanisms is fundamental to much of biomedical research, but until now there has been almost no computational infrastructure for this purpose. Recent work in the philosophy of science has explored the central role that the search for mechanistic accounts of biological phenomena plays in biomedical research, providing a conceptual basis for representing and analyzing biological mechanism. The foundational categories for components of mechanisms-entities and activities-guide the development of general, abstract types of biological mechanism parts. Building on that analysis, we have developed a formal framework for describing and representing biological mechanism, MecCog, and applied it to describing mechanisms underlying human genetic disease. Mechanisms are depicted using a graphical notation. Key features are assignment of mechanism components to stages of biological organization and classes; visual representation of uncertainty, ignorance, and ambiguity; and tight integration with literature sources. The MecCog framework facilitates analysis of many aspects of disease mechanism, including the prioritization of future experiments, probing of gene-drug and gene-environment interactions, identification of possible new drug targets, personalized drug choice, analysis of nonlinear interactions between relevant genetic loci, and classification of diseases based on mechanism.
Collapse
Affiliation(s)
- Lindley Darden
- Department of Philosophy, University of Maryland College Park, College Park, Maryland, United States of America
| | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland, United States of America
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland College Park, College Park, Maryland, United States of America
| | - Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland, United States of America
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, Rockville, Maryland, United States of America
- Department of Cell Biology and Molecular Genetics, University of Maryland College Park, College Park, Maryland, United States of America
| |
Collapse
|
13
|
Frias-Soler RC, Villarín Pildaín L, Hotz-Wagenblatt A, Kolibius J, Bairlein F, Wink M. De novo annotation of the transcriptome of the Northern Wheatear ( Oenanthe oenanthe). PeerJ 2018; 6:e5860. [PMID: 30498627 PMCID: PMC6251345 DOI: 10.7717/peerj.5860] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/02/2018] [Indexed: 11/20/2022] Open
Abstract
We have sequenced a partial transcriptome of the Northern Wheatear (Oenanthe oenanthe), a species with one of the longest migrations on Earth. The transcriptome was constructed de novo using RNA-Seq sequence data from the pooled mRNA of six different tissues: brain, muscle, intestine, liver, adipose tissue and skin. The samples came from nine captive-bred wheatears collected at three different stages of the endogenous autumn migratory period: (1) lean birds prior the onset of migration, (2) during the fattening stage and (3) individuals at their migratory body mass plateau, when they have almost doubled their lean body mass. The sample structure used to build up the transcriptome of the Northern Wheatears concerning tissue composition and time guarantees the future survey of the regulatory genes involved in the development of the migratory phenotype. Through the pre-migratory period, birds accomplish outstanding physical and behavioural changes that involve all organ systems. Nevertheless, the molecular mechanisms through which birds synchronize and control hyperphagia, fattening, restlessness increase, immunity boosting and tuning the muscles for such endurance flight are still largely unknown. The use of RNA-Seq has emerged as a powerful tool to analyse complex traits on a broad scale, and we believe it can help to characterize the migratory phenotype of wheatears at an unprecedented level. The primary challenge to conduct quantitative transcriptomic studies in non-model species is the availability of a reference transcriptome, which we have constructed and described in this paper. The cDNA was sequenced by pyrosequencing using the Genome Sequencer Roche GS FLX System; with single paired-end reads of about 400 bp. We estimate the total number of genes at 15,640, of which 67% could be annotated using Turkey and Zebra Finch genomes, or protein sequence information from SwissProt and NCBI databases. With our study, we have made a first step towards understanding the migratory phenotype regarding gene expression of a species that has become a model to study birds long-distance migrations.
Collapse
Affiliation(s)
- Roberto Carlos Frias-Soler
- Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Baden Württemberg, Germany.,Institute of Avian Research, Wilhelmshaven, Germany
| | - Lilian Villarín Pildaín
- Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Baden Württemberg, Germany
| | - Agnes Hotz-Wagenblatt
- Bioinformatics Group, Core Facility Genomics and Proteomics, German Cancer Research Center, Heidelberg University, Heidelberg, Baden Württemberg, Germany
| | - Jonas Kolibius
- Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Baden Württemberg, Germany
| | | | - Michael Wink
- Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Baden Württemberg, Germany
| |
Collapse
|
14
|
Soto AJ, Zerva C, Batista-Navarro R, Ananiadou S. LitPathExplorer: a confidence-based visual text analytics tool for exploring literature-enriched pathway models. Bioinformatics 2018; 34:1389-1397. [PMID: 29228271 DOI: 10.1093/bioinformatics/btx774] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2017] [Accepted: 12/07/2017] [Indexed: 01/25/2023] Open
Abstract
Motivation Pathway models are valuable resources that help us understand the various mechanisms underpinning complex biological processes. Their curation is typically carried out through manual inspection of published scientific literature to find information relevant to a model, which is a laborious and knowledge-intensive task. Furthermore, models curated manually cannot be easily updated and maintained with new evidence extracted from the literature without automated support. Results We have developed LitPathExplorer, a visual text analytics tool that integrates advanced text mining, semi-supervised learning and interactive visualization, to facilitate the exploration and analysis of pathway models using statements (i.e. events) extracted automatically from the literature and organized according to levels of confidence. LitPathExplorer supports pathway modellers and curators alike by: (i) extracting events from the literature that corroborate existing models with evidence; (ii) discovering new events which can update models; and (iii) providing a confidence value for each event that is automatically computed based on linguistic features and article metadata. Our evaluation of event extraction showed a precision of 89% and a recall of 71%. Evaluation of our confidence measure, when used for ranking sampled events, showed an average precision ranging between 61 and 73%, which can be improved to 95% when the user is involved in the semi-supervised learning process. Qualitative evaluation using pair analytics based on the feedback of three domain experts confirmed the utility of our tool within the context of pathway model exploration. Availability and implementation LitPathExplorer is available at http://nactem.ac.uk/LitPathExplorer_BI/. Contact sophia.ananiadou@manchester.ac.uk. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Axel J Soto
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester M1 7DN, UK
| | - Chrysoula Zerva
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester M1 7DN, UK
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester M1 7DN, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester M1 7DN, UK
| |
Collapse
|
15
|
Stress-Adaptive Responses Associated with High-Level Carbapenem Resistance in KPC-Producing Klebsiella pneumoniae. J Pathog 2018; 2018:3028290. [PMID: 29657865 PMCID: PMC5883989 DOI: 10.1155/2018/3028290] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Accepted: 02/13/2018] [Indexed: 01/13/2023] Open
Abstract
Carbapenem-resistant Enterobacteriaceae (CRE) organisms have emerged to become a major global public health threat among antimicrobial resistant bacterial human pathogens. Little is known about how CREs emerge. One characteristic phenotype of CREs is heteroresistance, which is clinically associated with treatment failure in patients given a carbapenem. Through in vitro whole-transcriptome analysis we tracked gene expression over time in two different strains (BR7, BR21) of heteroresistant KPC-producing Klebsiella pneumoniae, first exposed to a bactericidal concentration of imipenem followed by growth in drug-free medium. In both strains, the immediate response was dominated by a shift in expression of genes involved in glycolysis toward those involved in catabolic pathways. This response was followed by global dampening of transcriptional changes involving protein translation, folding and transport, and decreased expression of genes encoding critical junctures of lipopolysaccharide biosynthesis. The emerged high-level carbapenem-resistant BR21 subpopulation had a prophage (IS1) disrupting ompK36 associated with irreversible OmpK36 porin loss. On the other hand, OmpK36 loss in BR7 was reversible. The acquisition of high-level carbapenem resistance by the two heteroresistant strains was associated with distinct and shared stepwise transcriptional programs. Carbapenem heteroresistance may emerge from the most adaptive subpopulation among a population of cells undergoing a complex set of stress-adaptive responses.
Collapse
|
16
|
Hawkins LK, Warburton ML, Tang J, Tomashek J, Alves Oliveira D, Ogunola OF, Smith JS, Williams WP. Survey of Candidate Genes for Maize Resistance to Infection by Aspergillus flavus and/or Aflatoxin Contamination. Toxins (Basel) 2018; 10:E61. [PMID: 29385107 PMCID: PMC5848162 DOI: 10.3390/toxins10020061] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 01/20/2018] [Accepted: 01/24/2018] [Indexed: 12/21/2022] Open
Abstract
Many projects have identified candidate genes for resistance to aflatoxin accumulation or Aspergillus flavus infection and growth in maize using genetic mapping, genomics, transcriptomics and/or proteomics studies. However, only a small percentage of these candidates have been validated in field conditions, and their relative contribution to resistance, if any, is unknown. This study presents a consolidated list of candidate genes identified in past studies or in-house studies, with descriptive data including genetic location, gene annotation, known protein identifiers, and associated pathway information, if known. A candidate gene pipeline to test the phenotypic effect of any maize DNA sequence on aflatoxin accumulation resistance was used in this study to determine any measurable effect on polymorphisms within or linked to the candidate gene sequences, and the results are published here.
Collapse
Affiliation(s)
- Leigh K Hawkins
- USDA ARS Corn Host Plant Resistance Research Unit, Mississippi State, MS 39762, USA.
| | - Marilyn L Warburton
- USDA ARS Corn Host Plant Resistance Research Unit, Mississippi State, MS 39762, USA.
| | - Juliet Tang
- USDA FS Durability and Wood Protection Research Unit, Starkville, MS 39759, USA.
| | - John Tomashek
- Integrated Micro-Chromatography Systems LLC, Irmo, SC 29063, USA.
| | - Dafne Alves Oliveira
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Starkville, MS 39762 USA.
| | - Oluwaseun F Ogunola
- Department of Plant and Soil Sciences, Mississippi State University, Starkville, MS 39762, USA.
| | - J Spencer Smith
- USDA ARS Corn Host Plant Resistance Research Unit, Mississippi State, MS 39762, USA.
| | - W Paul Williams
- USDA ARS Corn Host Plant Resistance Research Unit, Mississippi State, MS 39762, USA.
| |
Collapse
|
17
|
Larsen RR, Hastings J. From Affective Science to Psychiatric Disorder: Ontology as a Semantic Bridge. Front Psychiatry 2018; 9:487. [PMID: 30349491 PMCID: PMC6186823 DOI: 10.3389/fpsyt.2018.00487] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 09/18/2018] [Indexed: 12/25/2022] Open
Abstract
Advances in emotion and affective science have yet to translate routinely into psychiatric research and practice. This is unfortunate since emotion and affect are fundamental components of many psychiatric conditions. Rectifying this lack of interdisciplinary integration could thus be a potential avenue for improving psychiatric diagnosis and treatment. In this contribution, we propose and discuss an ontological framework for explicitly capturing the complex interrelations between affective entities and psychiatric disorders, in order to facilitate mapping and integration between affective science and psychiatric diagnostics. We build on and enhance the categorisation of emotion, affect and mood within the previously developed Emotion Ontology, and that of psychiatric disorders in the Mental Disease Ontology. This effort further draws on developments in formal ontology regarding the distinction between normal and abnormal in order to formalize the interconnections. This operational semantic framework is relevant for applications including clarifying psychiatric diagnostic categories, clinical information systems, and the integration and translation of research results across disciplines.
Collapse
Affiliation(s)
- Rasmus Rosenberg Larsen
- Department of Philosophy and Forensic Science Program, University of Toronto, Mississauga, ON, Canada
| | - Janna Hastings
- Department of Biological Sciences, Babraham Institute, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
18
|
Júnior GAO, Perez BC, Cole JB, Santana MHA, Silveira J, Mazzoni G, Ventura RV, Júnior MLS, Kadarmideen HN, Garrick DJ, Ferraz JBS. Genomic study and Medical Subject Headings enrichment analysis of early pregnancy rate and antral follicle numbers in Nelore heifers. J Anim Sci 2017; 95:4796-4812. [PMID: 29293733 PMCID: PMC6292327 DOI: 10.2527/jas2017.1752] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 08/24/2017] [Indexed: 12/18/2022] Open
Abstract
Zebu animals () are known to take longer to reach puberty compared with taurine animals (), limiting the supply of animals for harvest or breeding and impacting profitability. Genomic information can be a helpful tool to better understand complex traits and improve genetic gains. In this study, we performed a genomewide association study (GWAS) to identify genetic variants associated with reproductive traits in Nelore beef cattle. Heifer pregnancy (HP) was recorded for 1,267 genotyped animals distributed in 12 contemporary groups (CG) with an average pregnancy rate of 0.35 (±0.01). Disregarding one of these CG, the number of antral follicles (NF) was also collected for 937 of these animals, with an average of 11.53 (±4.43). The animals were organized in CG: 12 and 11 for HP and NF, respectively. Genes in linkage disequilibrium (LD) with the associated variants can be considered in a functional enrichment analysis to identify biological mechanisms involved in fertility. Medical Subject Headings (MeSH) were detected using the MESHR package, allowing the extraction of broad meanings from the gene lists provided by the GWAS. The estimated heritability for HP was 0.28 ± 0.07 and for NF was 0.49 ± 0.09, with the genomic correlation being -0.21 ± 0.29. The average LD between adjacent markers was 0.23 ± 0.01, and GWAS identified genomic windows that accounted for >1% of total genetic variance on chromosomes 5, 14, and 18 for HP and on chromosomes 2, 8, 11, 14, 15, 16, and 22 for NF. The MeSH enrichment analyses revealed significant ( < 0.05) terms associated with HP-"Munc18 Proteins," "Fucose," and "Hemoglobins"-and with NF-"Cathepsin B," "Receptors, Neuropeptide," and "Palmitic Acid." This is the first study in Nelore cattle introducing the concept of MeSH analysis. The genomic analyses contributed to a better understanding of the genetic control of the reproductive traits HP and NF and provide new selection strategies to improve beef production.
Collapse
Affiliation(s)
| | - B. C. Perez
- Universidade de São Paulo (USP), Pirassununga, SP, Brazil
| | - J. B. Cole
- Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD 20705-2350
| | | | - J. Silveira
- Universidade de São Paulo (USP), Pirassununga, SP, Brazil
| | - G. Mazzoni
- Department of Veterinary and Animal Sciences, University of Copenhagen, Denmark
- Section of Systems Genomics, Department of Bio and Health Informatics, Technical University of Denmark, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| | - R. V. Ventura
- Beef Improvement Opportunities, Guelph, ON N1K1E5, Canada
- Centre for Genetic Improvement of Livestock, University of Guelph, Guelph, ON N1G2W1, Canada
| | | | - H. N. Kadarmideen
- Section of Systems Genomics, Department of Bio and Health Informatics, Technical University of Denmark, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| | | | | |
Collapse
|
19
|
Bouadjenek MR, Verspoor K, Zobel J. Automated detection of records in biological sequence databases that are inconsistent with the literature. J Biomed Inform 2017. [PMID: 28624643 DOI: 10.1016/j.jbi.2017.06.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| |
Collapse
|
20
|
Chibucos MC, Siegele DA, Hu JC, Giglio M. The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations. Methods Mol Biol 2017; 1446:245-259. [PMID: 27812948 DOI: 10.1007/978-1-4939-3743-1_18] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The Evidence and Conclusion Ontology (ECO) is a community resource for describing the various types of evidence that are generated during the course of a scientific study and which are typically used to support assertions made by researchers. ECO describes multiple evidence types, including evidence resulting from experimental (i.e., wet lab) techniques, evidence arising from computational methods, statements made by authors (whether or not supported by evidence), and inferences drawn by researchers curating the literature. In addition to summarizing the evidence that supports a particular assertion, ECO also offers a means to document whether a computer or a human performed the process of making the annotation. Incorporating ECO into an annotation system makes it possible to leverage the structure of the ontology such that associated data can be grouped hierarchically, users can select data associated with particular evidence types, and quality control pipelines can be optimized. Today, over 30 resources, including the Gene Ontology, use the Evidence and Conclusion Ontology to represent both evidence and how annotations are made.
Collapse
Affiliation(s)
- Marcus C Chibucos
- Department of Microbiology and Immunology, Institute for Genome Sciences, University of Maryland School of Medicine, 801 W. Baltimore Street, Baltimore, MD, 21201, USA.
| | - Deborah A Siegele
- Department of Biology, Texas A&M University, College Station, TX, 77843, USA
| | - James C Hu
- Department of Biochemistry and Biophysics, Texas A&M University and Texas AgriLife Research, College Station, TX, 77843, USA
| | - Michelle Giglio
- Department of Medicine, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| |
Collapse
|
21
|
Abstract
Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.
Collapse
Affiliation(s)
- Nives Škunca
- Department of Computer Science, ETH Zurich, Universitätstrasse 19, 8092, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Universitätstr. 19, 8092, Zurich, Switzerland.
- University College London, Street Gower St, WC1E 6BT, London, UK.
| | | | - Martin Steffen
- Department of Biomedical Engineering, Boston University, Boston, MA, USA
- Department of Pathology and Laboratory Medicine, Boston University School of Medicine, Boston, MA, USA
| |
Collapse
|
22
|
Abstract
In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.
Collapse
Affiliation(s)
- Patrick Ruch
- SIB Text Mining, Swiss Institute of Bioinformatics, Geneva, Switzerland.
- BiTeM Group, HES-SO\HEG Genève, 7 route de Drize, CH-1227, Carouge, Switzerland.
| |
Collapse
|
23
|
Abstract
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA.
| | - Rebecca Davidson
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Eyal Akiva
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| |
Collapse
|
24
|
Abstract
The Gene Ontology (GO) is a formidable resource, but there are several considerations about it that are essential to understand the data and interpret it correctly. The GO is sufficiently simple that it can be used without deep understanding of its structure or how it is developed, which is both a strength and a weakness. In this chapter, we discuss some common misinterpretations of the ontology and the annotations. A better understanding of the pitfalls and the biases in the GO should help users make the most of this very rich resource. We also review some of the misconceptions and misleading assumptions commonly made about GO, including the effect of data incompleteness, the importance of annotation qualifiers, and the transitivity or lack thereof associated with different ontology relations. We also discuss several biases that can confound aggregate analyses such as gene enrichment analyses. For each of these pitfalls and biases, we suggest remedies and best practices.
Collapse
Affiliation(s)
- Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel-Servet, 1211, Geneva 4, Switzerland. .,Department of Human Protein Sciences, Faculty of Medicine, University of Geneva, 1211, Geneva, Switzerland.
| | - Christophe Dessimoz
- Department of Genetics, Evolution & Environment, University College London, Gower St, London, WC1E 6BT, UK.,Swiss Institute of Bioinformatics, Biophore Building, 1015, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Street Biophore, 1015, Lausanne, Switzerland.,Center of Integrative Genomics, University of Lausanne, Biophore, 1015, Lausanne, Switzerland.,Department of Computer Science, University College London, Gower St, WC1E 6BT, London, UK
| |
Collapse
|
25
|
Gene Co-Expression Network Analysis Unraveling Transcriptional Regulation of High-Altitude Adaptation of Tibetan Pig. PLoS One 2016; 11:e0168161. [PMID: 27936142 PMCID: PMC5148111 DOI: 10.1371/journal.pone.0168161] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 11/27/2016] [Indexed: 02/08/2023] Open
Abstract
Tibetan pigs have survived at high altitude for millennia and they have a suite of adaptive features to tolerate the hypoxic environment. However, the molecular mechanisms underlying the regulation of hypoxia-adaptive phenotypes have not been completely elucidated. In this study, we analyzed differentially expressed genes (DEGs), biological pathways and constructed co-expression regulation networks using whole-transcriptome microarrays from lung tissues of Tibetan and Duroc pigs both at high and low altitude. A total of 3,066 DEGs were identified and this list was over-represented for the ontology terms including metabolic process, catalytic activity, and KEGG pathway including metabolic pathway and PI3K-Akt signaling pathway. The regulatory (RIF) and phenotypic (PIF) impact factor analysis identified several known and several potentially novel regulators of hypoxia adaption, including: IKBKG, KLF6 and RBPJ (RIF1), SF3B1, EFEMP1, HOXB6 and ATF6 (RIF2). These findings provide new details of the regulatory architecture of hypoxia-adaptive genes and also insight into which genes may undergo epigenetic modification for further study in the high-altitude adaptation.
Collapse
|
26
|
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional Annotations of Paralogs: A Blessing and a Curse. Life (Basel) 2016; 6:life6030039. [PMID: 27618105 PMCID: PMC5041015 DOI: 10.3390/life6030039] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/29/2016] [Accepted: 09/02/2016] [Indexed: 12/15/2022] Open
Abstract
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines.
Collapse
Affiliation(s)
- Rémi Zallot
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Katherine J Harrison
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| |
Collapse
|
27
|
The SIB Swiss Institute of Bioinformatics' resources: focus on curated databases. Nucleic Acids Res 2015; 44:D27-37. [PMID: 26615188 PMCID: PMC4702916 DOI: 10.1093/nar/gkv1310] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 11/09/2015] [Indexed: 12/15/2022] Open
Abstract
The SIB Swiss Institute of Bioinformatics (www.isb-sib.ch) provides world-class bioinformatics databases, software tools, services and training to the international life science community in academia and industry. These solutions allow life scientists to turn the exponentially growing amount of data into knowledge. Here, we provide an overview of SIB's resources and competence areas, with a strong focus on curated databases and SIB's most popular and widely used resources. In particular, SIB's Bioinformatics resource portal ExPASy features over 150 resources, including UniProtKB/Swiss-Prot, ENZYME, PROSITE, neXtProt, STRING, UniCarbKB, SugarBindDB, SwissRegulon, EPD, arrayMap, Bgee, SWISS-MODEL Repository, OMA, OrthoDB and other databases, which are briefly described in this article.
Collapse
|