1
|
Sharma D, Aslam D, Sharma K, Mittal A, Jayaram B. Exon-intron boundary detection made easy by physicochemical properties of DNA. Mol Omics 2025; 21:226-239. [PMID: 40094442 DOI: 10.1039/d4mo00241e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Genome architecture in eukaryotes exhibits a high degree of complexity. Amidst the numerous intricacies, the existence of genes as non-continuous stretches composed of exons and introns has garnered significant attention and curiosity among researchers. Accurate identification of exon-intron (EI) boundaries is crucial to decipher the molecular biology governing gene expression and regulation. This includes understanding both normal and aberrant splicing, with aberrant splicing referring to the abnormal processing of pre-mRNA that leads to improper inclusion or exclusion of exons or introns. Such splicing events can result in dysfunctional or non-functional proteins, which are often associated with various diseases. The currently employed frameworks for genomic signals, which aim to identify exons and introns within a genomic segment, need to be revised primarily due to the lack of a robust consensus sequence and the limitations posed by the training on available experimental datasets. To tackle these challenges and capitalize on the understanding that DNA exhibits function-dependent local physicochemical variations, we present ChemEXIN, an innovative novel method for predicting EI boundaries. The method utilizes a deep-learning (DL) architecture alongside tri- and tetra-nucleotide-based structural and energy features. ChemEXIN outperforms existing methods with notable accuracy and precision. It achieves an accuracy of 92.5% for humans, 79.9% for mice, and 92.0% for worms, along with precision values of 92.0%, 79.6%, and 91.8% for the same organisms, respectively. These results represent a significant advancement in EI boundary annotations, with potential implications for understanding gene expression, regulation, and cellular functions.
Collapse
Affiliation(s)
- Dinesh Sharma
- Supercomputing Facility for Bioinformatics & Computational Biology (SCFBio), Kusuma School of Biological Sciences, Indian Institute of Technology (IIT) Delhi, Hauz Khas, New Delhi 110016, India.
| | - Danish Aslam
- Supercomputing Facility for Bioinformatics & Computational Biology (SCFBio), Kusuma School of Biological Sciences, Indian Institute of Technology (IIT) Delhi, Hauz Khas, New Delhi 110016, India.
| | - Kopal Sharma
- Supercomputing Facility for Bioinformatics & Computational Biology (SCFBio), Kusuma School of Biological Sciences, Indian Institute of Technology (IIT) Delhi, Hauz Khas, New Delhi 110016, India.
| | - Aditya Mittal
- Supercomputing Facility for Bioinformatics & Computational Biology (SCFBio), Kusuma School of Biological Sciences, Indian Institute of Technology (IIT) Delhi, Hauz Khas, New Delhi 110016, India.
| | - B Jayaram
- Supercomputing Facility for Bioinformatics & Computational Biology (SCFBio), Kusuma School of Biological Sciences, Indian Institute of Technology (IIT) Delhi, Hauz Khas, New Delhi 110016, India.
- Department of Chemistry, Indian Institute of Technology (IIT) Delhi, Hauz Khas, New Delhi 110016, India
| |
Collapse
|
2
|
Si Y, Li H, Li X. Difference Analysis Among Six Kinds of Acceptor Splicing Sequences by the Dispersion Features of 6-mer Subsets in Human Genes. BIOLOGY 2025; 14:206. [PMID: 40001974 PMCID: PMC11853274 DOI: 10.3390/biology14020206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2025] [Revised: 02/07/2025] [Accepted: 02/13/2025] [Indexed: 02/27/2025]
Abstract
Identifying the sequence composition of different splicing modes is a challenge in current research. This study explored the dispersion distributions of 6-mer subsets in human acceptor splicing regions. Without differentiating acceptor splicing modes, obvious differences were observed across the upstream, core, and downstream regions of splicing sites for 16 dispersion distributions. These findings indicate that the dispersion value of each subset can effectively characterize the compositional properties of splicing sequences. When acceptor splicing sequences were classified into common, constitutive, and alternative modes, the differences in dispersion distributions for most of the XY1 6-mer subsets were significant among the three splicing modes. Furthermore, the alternative splicing mode was classified into normal, exonic, and intronic sub-modes, the differences in dispersion distributions for most of the XY1 6-mer subsets were also significant among the three splicing sub-modes. Our results indicate that dispersion values of XY1 6-mer subsets not only revealed the sequence composition patterns of acceptor splicing regions but also effectively identified the differences in base correlation among various acceptor splicing modes. Our research provides new insights into revealing and predicting different splicing modes.
Collapse
Affiliation(s)
| | - Hong Li
- Inner Mongolia Autonomous Region Key Laboratory of Biophysics and Bioinformatics, School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China; (Y.S.); (X.L.)
| | | |
Collapse
|
3
|
Madrigal G, Minhas BF, Catchen J. Klumpy: A tool to evaluate the integrity of long-read genome assemblies and illusive sequence motifs. Mol Ecol Resour 2025; 25:e13982. [PMID: 38800997 PMCID: PMC11646305 DOI: 10.1111/1755-0998.13982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Accepted: 05/13/2024] [Indexed: 05/29/2024]
Abstract
The improvement and decreasing costs of third-generation sequencing technologies has widened the scope of biological questions researchers can address with de novo genome assemblies. With the increasing number of reference genomes, validating their integrity with minimal overhead is vital for establishing confident results in their applications. Here, we present Klumpy, a tool for detecting and visualizing both misassembled regions in a genome assembly and genetic elements (e.g. genes) of interest in a set of sequences. By leveraging the initial raw reads in combination with their respective genome assembly, we illustrate Klumpy's utility by investigating antifreeze glycoprotein (afgp) loci across two icefishes, by searching for a reported absent gene in the northern snakehead fish, and by scanning the reference genomes of a mudskipper and bumblebee for misassembled regions. In the two former cases, we were able to provide support for the noncanonical placement of an afgp locus in the icefishes and locate the missing snakehead gene. Furthermore, our genome scans were able identify an unmappable locus in the mudskipper reference genome and identify a putative repetitive element shared among several species of bees.
Collapse
Affiliation(s)
- Giovanni Madrigal
- Department of Evolution, Ecology, and BehaviorUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
| | - Bushra Fazal Minhas
- Informatics ProgramUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
| | - Julian Catchen
- Department of Evolution, Ecology, and BehaviorUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
- Informatics ProgramUniversity of Illinois at Urbana‐ChampaignUrbanaIllinoisUSA
| |
Collapse
|
4
|
Chen M, Li Y, Zhang K, Liu H. Protein coding regions prediction by fusing DNA shape features. N Biotechnol 2024; 80:21-26. [PMID: 38182076 DOI: 10.1016/j.nbt.2023.12.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Revised: 11/14/2023] [Accepted: 12/23/2023] [Indexed: 01/07/2024]
Abstract
Exons crucial for coding are often hidden within introns, and the two tend to vary greatly in length, which results in deep learning-based protein coding region prediction methods often performing poorly when applied to more structurally complex biological genomes. DNA shape information also plays a role in revealing the underlying logic of gene expression, yet current methods ignore the influence of DNA shape features when distinguishing coding and non-coding regions. We propose a method to predict protein-coding regions using the CNNS-BRNN model, which incorporates DNA shape features and improves the model's ability to distinguish between intronic and exonic features. We use a fusion coding technique that combines DNA shape features and traditional sequence features. Experiments show that this method outperforms the baseline method in metrics such as AUC and F1 by 2.3% and 5.3%, respectively, and the fusion coding method that introduces DNA shape features has a significant improvement in model performance.
Collapse
Affiliation(s)
- Miao Chen
- Ocean University of China, College of Computer Science and Technology, Qingdao 266100, China
| | - Yangyang Li
- Ocean University of China, College of Computer Science and Technology, Qingdao 266100, China
| | - Kun Zhang
- Ocean University of China, College of Computer Science and Technology, Qingdao 266100, China
| | - Hao Liu
- Ocean University of China, College of Computer Science and Technology, Qingdao 266100, China.
| |
Collapse
|
5
|
Carter EL, Constantinidou C, Alam MT. Applications of genome-scale metabolic models to investigate microbial metabolic adaptations in response to genetic or environmental perturbations. Brief Bioinform 2023; 25:bbad439. [PMID: 38048080 PMCID: PMC10694557 DOI: 10.1093/bib/bbad439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 09/21/2023] [Accepted: 11/08/2023] [Indexed: 12/05/2023] Open
Abstract
Environmental perturbations are encountered by microorganisms regularly and will require metabolic adaptations to ensure an organism can survive in the newly presenting conditions. In order to study the mechanisms of metabolic adaptation in such conditions, various experimental and computational approaches have been used. Genome-scale metabolic models (GEMs) are one of the most powerful approaches to study metabolism, providing a platform to study the systems level adaptations of an organism to different environments which could otherwise be infeasible experimentally. In this review, we are describing the application of GEMs in understanding how microbes reprogram their metabolic system as a result of environmental variation. In particular, we provide the details of metabolic model reconstruction approaches, various algorithms and tools for model simulation, consequences of genetic perturbations, integration of '-omics' datasets for creating context-specific models and their application in studying metabolic adaptation due to the change in environmental conditions.
Collapse
Affiliation(s)
- Elena Lucy Carter
- Warwick Medical School, University of Warwick, Coventry, CV4 7HL, UK
| | | | | |
Collapse
|
6
|
Sharma D, Sharma K, Mishra A, Siwach P, Mittal A, Jayaram B. Molecular dynamics simulation-based trinucleotide and tetranucleotide level structural and energy characterization of the functional units of genomic DNA. Phys Chem Chem Phys 2023; 25:7323-7337. [PMID: 36825435 DOI: 10.1039/d2cp04820e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
Abstract
Genomes of most organisms on earth are written in a universal language of life, made up of four units - adenine (A), thymine (T), guanine (G), and cytosine (C), and understanding the way they are put together has been a great challenge to date. Multiple efforts have been made to annotate this wonderfully engineered string of DNA using different methods but they lack a universal character. In this article, we have investigated the structural and energetic profiles of both prokaryotes and eukaryotes by considering two essential genomic sites, viz., the transcription start sites (TSS) and exon-intron boundaries. We have characterized these sites by mapping the structural and energy features of DNA obtained from molecular dynamics simulations, which considers all possible trinucleotide and tetranucleotide steps. For DNA, these physicochemical properties show distinct signatures at the TSS and intron-exon boundaries. Our results firmly convey the idea that DNA uses the same dialect for prokaryotes and eukaryotes and that it is worth going beyond sequence-level analyses to physicochemical space to determine the functional destiny of DNA sequences.
Collapse
Affiliation(s)
- Dinesh Sharma
- Supercomputing Facility for Bioinformatics & Computational Biology, Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India
| | - Kopal Sharma
- Supercomputing Facility for Bioinformatics & Computational Biology, Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India
| | - Akhilesh Mishra
- Supercomputing Facility for Bioinformatics & Computational Biology, Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India
| | - Priyanka Siwach
- Department of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India
| | - Aditya Mittal
- Supercomputing Facility for Bioinformatics & Computational Biology, Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India
| | - B Jayaram
- Supercomputing Facility for Bioinformatics & Computational Biology, Kusuma School of Biological Sciences, Indian Institute of Technology, Delhi, India.,Department of Chemistry, Indian Institute of Technology, Delhi, India.
| |
Collapse
|
7
|
Sirupurapu V, Safonova Y, Pevzner P. Gene prediction in the immunoglobulin loci. Genome Res 2022; 32:1152-1169. [PMID: 35545447 DOI: 10.1101/gr.276676.122] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 05/06/2022] [Indexed: 11/25/2022]
Abstract
The V(D)J recombination process rearranges the variable (V), diversity (D), and joining (J) genes in the immunoglobulin loci to generate antibody repertoires. Annotation of these loci across various species and predicting the V, D, and J genes (IG genes) is critical for studies of the adaptive immune system. However, since the standard gene finding algorithms are not suitable for predicting IG genes, they have been semi-manually annotated in very few species. We developed the IGDetective algorithm for predicting IG genes and applied it to species with the assembled IG loci. IGDetective generated the first large collection of IG genes across many species and enabled their evolutionary analysis, including the analysis of the "bat IG diversity" hypothesis. This analysis revealed extremely conserved V genes in evolutionary distant species indicating that these genes may be subjected to the same selective pressure, e.g., pressure driven by common pathogens. IGDetective also revealed extremely diverged V genes and a new family of evolutionary conserved V genes in bats with unusual noncanonical cysteines. Moreover, in difference from all other previously reported antibodies, these cysteines are located within complementarity-determining regions. Since cysteines form disulfide bonds, we hypothesize that these cysteine-rich V genes might generate antibodies with noncanonical conformations and could potentially form a unique part of the immune repertoire in bats. We also analyzed the diversity landscape of the recombination signal sequences and revealed their features that trigger the high/low usage of the IG genes.
Collapse
|
8
|
Stevens L, Moya ND, Tanny RE, Gibson SB, Tracey A, Na H, Chitrakar R, Dekker J, Walhout AJ, Baugh LR, Andersen EC. Chromosome-level reference genomes for two strains of Caenorhabditis briggsae: an improved platform for comparative genomics. Genome Biol Evol 2022; 14:6554914. [PMID: 35348662 PMCID: PMC9011032 DOI: 10.1093/gbe/evac042] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/21/2022] [Indexed: 11/13/2022] Open
Abstract
Abstract
The publication of the Caenorhabditis briggsae reference genome in 2003 enabled the first comparative genomics studies between C. elegans and C. briggsae, shedding light on the evolution of genome content and structure in the Caenorhabditis genus. However, despite being widely used, the currently available C. briggsae reference genome is substantially less complete and structurally accurate than the C. elegans reference genome. Here, we used high-coverage Oxford Nanopore long-read and chromosome conformation capture data to generate chromosome-level reference genomes for two C. briggsae strains: QX1410, a new reference strain closely related to the laboratory AF16 strain, and VX34, a highly divergent strain isolated in China. We also sequenced 99 recombinant inbred lines (RILs) generated from reciprocal crosses between QX1410 and VX34 to create a recombination map and identify chromosomal domains. Additionally, we used both short- and long-read RNA sequencing (RNA-seq) data to generate high-quality gene annotations. By comparing these new reference genomes to the current reference, we reveal that hyper-divergent haplotypes cover large portions of the C. briggsae genome, similar to recent reports in C. elegans and C. tropicalis. We also show that the genomes of selfing Caenorhabditis species have undergone more rearrangement than their outcrossing relatives, which has biased previous estimates of rearrangement rate in Caenorhabditis. These new genomes provide a substantially improved platform for comparative genomics in Caenorhabditis and narrow the gap between the quality of genomic resources available for C. elegans and C. briggsae.
Collapse
Affiliation(s)
- Lewis Stevens
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
| | - Nicolas D. Moya
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
- Interdisciplinary Biological Sciences Program, Northwestern University, Evanston, IL 60208, USA
| | - Robyn E. Tanny
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
| | - Sophia B. Gibson
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
| | - Alan Tracey
- Tree of Life, Wellcome Sanger Institute, Cambridge, UK
| | - Huimin Na
- Department of Systems Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | | | - Job Dekker
- Department of Systems Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Albertha J.M. Walhout
- Department of Systems Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - L. Ryan Baugh
- Department of Biology, Duke University, Durham, NC, USA
- Center for Genomic and Computational Biology, Duke University, Durham, NC, USA
| | - Erik C. Andersen
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
| |
Collapse
|
9
|
Stamboulian M, Canderan J, Ye Y. Metaproteomics as a tool for studying the protein landscape of human-gut bacterial species. PLoS Comput Biol 2022; 18:e1009397. [PMID: 35302987 PMCID: PMC8967034 DOI: 10.1371/journal.pcbi.1009397] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 03/30/2022] [Accepted: 02/15/2022] [Indexed: 12/26/2022] Open
Abstract
Host-microbiome interactions and the microbial community have broad impact in human health and diseases. Most microbiome based studies are performed at the genome level based on next-generation sequencing techniques, but metaproteomics is emerging as a powerful technique to study microbiome functional activity by characterizing the complex and dynamic composition of microbial proteins. We conducted a large-scale survey of human gut microbiome metaproteomic data to identify generalist species that are ubiquitously expressed across all samples and specialists that are highly expressed in a small subset of samples associated with a certain phenotype. We were able to utilize the metaproteomic mass spectrometry data to reveal the protein landscapes of these species, which enables the characterization of the expression levels of proteins of different functions and underlying regulatory mechanisms, such as operons. Finally, we were able to recover a large number of open reading frames (ORFs) with spectral support, which were missed by de novo protein-coding gene predictors. We showed that a majority of the rescued ORFs overlapped with de novo predicted protein-coding genes, but on opposite strands or in different frames. Together, these demonstrate applications of metaproteomics for the characterization of important gut bacterial species. Many reference genomes for studying human gut microbiome are available, but knowledge about how microbial organisms work is limited. Identification of proteins at individual species or community level provides direct insight into the functionality of microbial organisms. By analyzing more than a thousand metaproteomics datasets, we examined protein landscapes of more than two thousands of microbial species that may be important to human health and diseases. This work demonstrated new applications of metaproteomic datasets for studying individual genomes. We made the analysis results available through a website (called GutBac), which we believe will become a resource for studying microbial species important for human health and diseases.
Collapse
Affiliation(s)
- Moses Stamboulian
- Computer Science Department, Indiana University, Bloomington, Indiana, United States of America
| | - Jamie Canderan
- Computer Science Department, Indiana University, Bloomington, Indiana, United States of America
| | - Yuzhen Ye
- Computer Science Department, Indiana University, Bloomington, Indiana, United States of America
- * E-mail:
| |
Collapse
|
10
|
OHTOMO M, KOBAYASHI T, KATO H. Development of the Dynamic Programming (DP) -based Functional Site Estimation System Using the Motif CodonReduced Representation. JOURNAL OF COMPUTER CHEMISTRY-JAPAN 2022. [DOI: 10.2477/jccj.2022-0005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Masahiro OHTOMO
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 611-0011, Japan
| | - Takashi KOBAYASHI
- Nitto Denko Corporation, 455-6, Hongo, Minogo-cho, Onomichi, Hiroshima, 722-0212, Japan
| | - Hiroaki KATO
- Department of Distribution and Information Engineering, National Institute of Technology, Hiroshima College, 4272-1, Higashino, Osakikamijima-cho, Toyota-gun, Hiroshima, 725-0231, Japan
| |
Collapse
|
11
|
|
12
|
Dimonaco NJ, Aubrey W, Kenobi K, Clare A, Creevey CJ. No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study. Bioinformatics 2021; 38:1198-1207. [PMID: 34875010 PMCID: PMC8825762 DOI: 10.1093/bioinformatics/btab827] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 11/13/2021] [Accepted: 12/02/2021] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis. RESULTS We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations. AVAILABILITY AND IMPLEMENTATION Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nicholas J Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, UK,To whom correspondence should be addressed.
| | - Wayne Aubrey
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK
| | - Kim Kenobi
- Department of Mathematics, Aberystwyth University, Aberystwyth SY23 3BZ, UK
| | - Amanda Clare
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK
| | | |
Collapse
|
13
|
Proteogenomic Analysis Provides Novel Insight into Genome Annotation and Nitrogen Metabolism in Nostoc sp. PCC 7120. Microbiol Spectr 2021; 9:e0049021. [PMID: 34523988 PMCID: PMC8557916 DOI: 10.1128/spectrum.00490-21] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Cyanobacteria, capable of oxygenic photosynthesis, play a vital role in nitrogen and carbon cycles. Nostoc sp. PCC 7120 (Nostoc 7120) is a model cyanobacterium commonly used to study cell differentiation and nitrogen metabolism. Although its genome was released in 2002, a high-quality genome annotation remains unavailable for this model cyanobacterium. Therefore, in this study, we performed an in-depth proteogenomic analysis based on high-resolution mass spectrometry (MS) data to refine the genome annotation of Nostoc 7120. We unambiguously identified 5,519 predicted protein-coding genes and revealed 26 novel genes, 75 revised genes, and 27 different kinds of posttranslational modifications in Nostoc 7120. A subset of these novel proteins were further validated at both the mRNA and peptide levels. Functional analysis suggested that many newly annotated proteins may participate in nitrogen or cadmium/mercury metabolism in Nostoc 7120. Moreover, we constructed an updated Nostoc 7120 database based on our proteogenomic results and presented examples of how the updated database could be used to improve the annotation of proteomic data. Our study provides the most comprehensive annotation of the Nostoc 7120 genome thus far and will serve as a valuable resource for the study of nitrogen metabolism in Nostoc 7120. IMPORTANCE Cyanobacteria are a large group of prokaryotes capable of oxygenic photosynthesis and play a vital role in nitrogen and carbon cycles on Earth. Nostoc 7120 is a commonly used model cyanobacterium for studying cell differentiation and nitrogen metabolism. In this study, we presented the first comprehensive draft map of the Nostoc 7120 proteome and a wide range of posttranslational modifications. In addition, we constructed an updated database of Nostoc 7120 based on our proteogenomic results and presented examples of how the updated database could be used for system-level studies of Nostoc 7120. Our study provides the most comprehensive annotation of Nostoc 7120 genome and a valuable resource for the study of nitrogen metabolism in this model cyanobacterium.
Collapse
|
14
|
Mathé C, Dunand C. Automatic Prediction and Annotation: There Are Strong Biases for Multigenic Families. Front Genet 2021; 12:697477. [PMID: 34603370 PMCID: PMC8481831 DOI: 10.3389/fgene.2021.697477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 08/05/2021] [Indexed: 11/16/2022] Open
Affiliation(s)
- Catherine Mathé
- Laboratoire de Recherche en Sciences Végétales, Université de Toulouse, CNRS, UPS, Toulouse INP, Auzeville-Tolosane, France
| | - Christophe Dunand
- Laboratoire de Recherche en Sciences Végétales, Université de Toulouse, CNRS, UPS, Toulouse INP, Auzeville-Tolosane, France
| |
Collapse
|
15
|
Basharat Z, Jahanzaib M, Rahman N. Therapeutic target identification via differential genome analysis of antibiotic resistant Shigella sonnei and inhibitor evaluation against a selected drug target. INFECTION GENETICS AND EVOLUTION 2021; 94:105004. [PMID: 34280580 DOI: 10.1016/j.meegid.2021.105004] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 07/11/2021] [Accepted: 07/14/2021] [Indexed: 12/29/2022]
Abstract
Shigella sonnei has been implicated in bloody diarrhea (accompanied by abdominal pain and fever) and is an emerging pathogen of concern, especially in developing countries. The major means of transmission is the fecal-oral route while sexual transmission has also been reported. In children, the impact might be stunted growth due to life-threatening illness. Resistance has been reported in this species for several types of antibiotics. In this study, we retrieved the antibiotic-resistant labeled whole genome sequences of the species from the PATRIC database and performed a pan-genome analysis to filter out core genes. Antibiotic resistance was studied in the core, accessory and unique genome. Core genes were utilized as seed substance for essentiality analysis and drug candidate assignment. Product of the gene aroG, i.e. chorismate biosynthetic process 3-deoxy-7-phosphoheptulonate synthase enzyme, responsible for aromatic amino acid family biosynthetic process, was taken for further downstream processing. Natural product libraries of flavonoids (n = 178), ZINC database derived inhibitor compounds of the 3-deoxy-7-phosphoheptulonate synthase enzyme (n = 112), and streptomycin compounds (n = 737) were docked to find out potent inhibitors, followed by dynamics simulation of 50 ns each for top compounds.. Physicochemical and ADMET profiling of the top compounds was done to analyze their safety for consumption. We propose that the top compounds: Phytoene from Streptomycin library and ZINC000036444158 (synonym:1,16-bis[(dihydroxyphosphinyl)oxy]hexadecane) from 3-deoxy-7-phosphoheptulonate synthase inhibitor library of ZINC database (and used as a control in this study) should be tested in vitro against Shigella sonnei, to fully determine their efficacy. This could add to the drying pipeline of potent drug molecules against emerging pathogens.
Collapse
Affiliation(s)
- Zarrin Basharat
- Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Centre for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, 75270 Karachi, Pakistan.
| | - Muhammad Jahanzaib
- Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Centre for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, 75270 Karachi, Pakistan
| | - Noor Rahman
- H.E.J. Research Institute of Chemistry, International Center for Chemical and Biological Sciences, University of Karachi, 75270 Karachi, Pakistan
| |
Collapse
|
16
|
Andronis CE, Hane JK, Bringans S, Hardy GESJ, Jacques S, Lipscombe R, Tan KC. Gene Validation and Remodelling Using Proteogenomics of Phytophthora cinnamomi, the Causal Agent of Dieback. Front Microbiol 2021; 12:665396. [PMID: 34394023 PMCID: PMC8360494 DOI: 10.3389/fmicb.2021.665396] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Accepted: 05/18/2021] [Indexed: 11/13/2022] Open
Abstract
Phytophthora cinnamomi is a pathogenic oomycete that causes plant dieback disease across a range of natural ecosystems and in many agriculturally important crops on a global scale. An annotated draft genome sequence is publicly available (JGI Mycocosm) and suggests 26,131 gene models. In this study, soluble mycelial, extracellular (secretome), and zoospore proteins of P. cinnamomi were exploited to refine the genome by correcting gene annotations and discovering novel genes. By implementing the diverse set of sub-proteomes into a generated proteogenomics pipeline, we were able to improve the P. cinnamomi genome annotation. Liquid chromatography mass spectrometry was used to obtain high confidence peptides with spectral matching to both the annotated genome and a generated 6-frame translation. Two thousand seven hundred sixty-four annotations from the draft genome were confirmed by spectral matching. Using a proteogenomic pipeline, mass spectra were used to edit the P. cinnamomi genome and allowed identification of 23 new gene models and 60 edited gene features using high confidence peptides obtained by mass spectrometry, suggesting a rate of incorrect annotations of 3% of the detectable proteome. The novel features were further validated by total peptide support, alongside functional analysis including the use of Gene Ontology and functional domain identification. We demonstrated the use of spectral data in combination with our proteogenomics pipeline can be used to improve the genome annotation of important plant diseases and identify missed genes. This study presents the first use of spectral data to edit and manually annotate an oomycete pathogen.
Collapse
Affiliation(s)
- Christina E Andronis
- Centre for Crop and Disease Management, Curtin University, Bentley, WA, Australia.,Proteomics International, Nedlands, WA, Australia
| | - James K Hane
- Centre for Crop and Disease Management, Curtin University, Bentley, WA, Australia.,Faculty of Science and Engineering, Curtin Institute for Computation, Curtin University, Perth, WA, Australia
| | | | - Giles E S J Hardy
- Centre for Phytophthora Science and Management, Murdoch University, Murdoch, WA, Australia
| | - Silke Jacques
- Centre for Crop and Disease Management, Curtin University, Bentley, WA, Australia
| | | | - Kar-Chun Tan
- Centre for Crop and Disease Management, Curtin University, Bentley, WA, Australia
| |
Collapse
|
17
|
Donhauser J, Qi W, Bergk-Pinto B, Frey B. High temperatures enhance the microbial genetic potential to recycle C and N from necromass in high-mountain soils. GLOBAL CHANGE BIOLOGY 2021; 27:1365-1386. [PMID: 33336444 DOI: 10.1111/gcb.15492] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 06/12/2023]
Abstract
Climate change is strongly affecting high-mountain soils and warming in particular is associated with pronounced changes in microbe-mediated C and N cycling, affecting plant-soil interactions and greenhouse gas balances and therefore feedbacks to global warming. We used shotgun metagenomics to assess changes in microbial community structures, as well as changes in microbial C- and N-cycling potential and stress response genes and we linked these data with changes in soil C and N pools and temperature-dependent measurements of bacterial growth rates. We did so by incubating high-elevation soil from the Swiss Alps at 4°C, 15°C, 25°C, or 35°C for 1 month. We found no shift with increasing temperature in the C-substrate-degrader community towards taxa more capable of degrading recalcitrant organic matter. Conversely, at 35°C, we found an increase in genes associated with the degradation and modification of microbial cell walls, together with high bacterial growth rates. Together, these findings suggest that the rapidly growing high-temperature community is fueled by necromass from heat-sensitive taxa. This interpretation was further supported by a shift in the microbial N-cycling potential towards N mineralization and assimilation under higher temperatures, along with reduced potential for conversions among inorganic N forms. Microbial stress-response genes reacted inconsistently to increasing temperature, suggesting that the high-temperature community was not severely stressed by these conditions. Rather, soil microbes were able to acclimate by changing the thermal properties of membranes and cell walls as indicated by an increase in genes involved in membrane and cell wall modifications as well as a shift in the optimum temperature for bacterial growth towards the treatment temperature. Overall, our results suggest that high temperatures, as they may occur with heat waves under global warming, promote a highly active microbial community capable of rapid mineralization of microbial necromass, which may transiently amplify warming effects.
Collapse
Affiliation(s)
- Jonathan Donhauser
- Rhizosphere Processes Group, Swiss Federal Institute for Forest, Snow and Landscape Research WSL, Birmensdorf, Switzerland
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich/University of Zurich, Zurich, Switzerland
| | - Benoît Bergk-Pinto
- Environmental Microbial Genomics, Laboratoire Ampère, École Centrale de Lyon, Université de Lyon, Ecully, France
| | - Beat Frey
- Rhizosphere Processes Group, Swiss Federal Institute for Forest, Snow and Landscape Research WSL, Birmensdorf, Switzerland
| |
Collapse
|
18
|
Rubio A, Pérez-Pulido AJ. Protein-Coding Genes of Helicobacter pylori Predominantly Present Purifying Selection though Many Membrane Proteins Suffer from Selection Pressure: A Proposal to Analyze Bacterial Pangenomes. Genes (Basel) 2021; 12:genes12030377. [PMID: 33800844 PMCID: PMC7998743 DOI: 10.3390/genes12030377] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 03/01/2021] [Accepted: 03/02/2021] [Indexed: 12/14/2022] Open
Abstract
The current availability of complete genome sequences has allowed knowing that bacterial genomes can bear genes not present in the genome of all the strains from a specific species. So, the genes shared by all the strains comprise the core of the species, but the pangenome can be much greater and usually includes genes appearing in one only strain. Once the pangenome of a species is estimated, other studies can be undertaken to generate new knowledge, such as the study of the evolutionary selection for protein-coding genes. Most of the genes of a pangenome are expected to be subject to purifying selection that assures the conservation of function, especially those in the core group. However, some genes can be subject to selection pressure, such as genes involved in virulence that need to escape to the host immune system, which is more common in the accessory group of the pangenome. We analyzed 180 strains of Helicobacter pylori, a bacterium that colonizes the gastric mucosa of half the world population and presents a low number of genes (around 1500 in a strain and 3000 in the pangenome). After the estimation of the pangenome, the evolutionary selection for each gene has been calculated, and we found that 85% of them are subject to purifying selection and the remaining genes present some grade of selection pressure. As expected, the latter group is enriched with genes encoding for membrane proteins putatively involved in interaction to host tissues. In addition, this group also presents a high number of uncharacterized genes and genes encoding for putative spurious proteins. It suggests that they could be false positives from the gene finders used for identifying them. All these results propose that this kind of analyses can be useful to validate gene predictions and functionally characterize proteins in complete genomes.
Collapse
|
19
|
Silva R, Padovani K, Góes F, Alves R. geneRFinder: gene finding in distinct metagenomic data complexities. BMC Bioinformatics 2021; 22:87. [PMID: 33632132 PMCID: PMC7905635 DOI: 10.1186/s12859-021-03997-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Accepted: 02/04/2021] [Indexed: 12/01/2022] Open
Abstract
Background Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. Results We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar’s test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. Conclusions We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/, and also we provide a novel, comprehensive benchmark data for gene prediction—which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions—available at https://sourceforge.net/p/generfinder-benchmark.
Collapse
Affiliation(s)
- Raíssa Silva
- Vale Institute of Technology, Boaventura da Silva, 955, Belém, BR, 66055-090, Brazil.,PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil
| | - Kleber Padovani
- PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil
| | - Fabiana Góes
- ICMC, University of São Paulo, Trab. São Carlense, 400, São Carlos, BR, 13566-590, Brazil
| | - Ronnie Alves
- Vale Institute of Technology, Boaventura da Silva, 955, Belém, BR, 66055-090, Brazil. .,PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil.
| |
Collapse
|
20
|
de Abreu VAC, Perdigão J, Almeida S. Metagenomic Approaches to Analyze Antimicrobial Resistance: An Overview. Front Genet 2021; 11:575592. [PMID: 33537056 PMCID: PMC7848172 DOI: 10.3389/fgene.2020.575592] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2020] [Accepted: 12/04/2020] [Indexed: 11/13/2022] Open
Abstract
Antimicrobial resistance is a major global public health problem, which develops when pathogens acquire antimicrobial resistance genes (ARGs), primarily through genetic recombination between commensal and pathogenic microbes. The resistome is a collection of all ARGs. In microorganisms, the primary method of ARG acquisition is horizontal gene transfer (HGT). Thus, understanding and identifying HGTs, can provide insight into the mechanisms of antimicrobial resistance transmission and dissemination. The use of high-throughput sequencing technologies has made the analysis of ARG sequences feasible and accessible. In particular, the metagenomic approach has facilitated the identification of community-based antimicrobial resistance. This approach is useful, as it allows access to the genomic data in an environmental sample without the need to isolate and culture microorganisms prior to analysis. Here, we aimed to reflect on the challenges of analyzing metagenomic data in the three main approaches for studying antimicrobial resistance: (i) analysis of microbial diversity, (ii) functional gene analysis, and (iii) searching the most complete and pertinent resistome databases.
Collapse
Affiliation(s)
- Vinicius A C de Abreu
- Laboratório de Bioinformática e Computação de Alto Desempenho (LaBioCad), Faculdade de Computação (FACOMP), Universidade Federal do Pará, Belém, Brazil
| | - José Perdigão
- Laboratório de Bioinformática e Computação de Alto Desempenho (LaBioCad), Faculdade de Computação (FACOMP), Universidade Federal do Pará, Belém, Brazil
| | - Sintia Almeida
- Central de Genômica e Bioinformática (CeGenBio), Núcleo de Pesquisa e Desenvolvimento de Medicamentos (NPDM), Departamento de Fisiologia e Farmacologia, Universidade Federal do Ceará, Fortaleza, Brazil
| |
Collapse
|
21
|
Soft Computing in Bioinformatics. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
22
|
Duarte CM, Ngugi DK, Alam I, Pearman J, Kamau A, Eguiluz VM, Gojobori T, Acinas SG, Gasol JM, Bajic V, Irigoien X. Sequencing effort dictates gene discovery in marine microbial metagenomes. Environ Microbiol 2020; 22:4589-4603. [PMID: 32743860 PMCID: PMC7756799 DOI: 10.1111/1462-2920.15182] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2020] [Revised: 07/02/2020] [Accepted: 07/31/2020] [Indexed: 01/09/2023]
Abstract
Massive metagenomic sequencing combined with gene prediction methods were previously used to compile the gene catalogue of the ocean and host-associated microbes. Global expeditions conducted over the past 15 years have sampled the ocean to build a catalogue of genes from pelagic microbes. Here we undertook a large sequencing effort of a perturbed Red Sea plankton community to uncover that the rate of gene discovery increases continuously with sequencing effort, with no indication that the retrieved 2.83 million non-redundant (complete) genes predicted from the experiment represented a nearly complete inventory of the genes present in the sampled community (i.e., no evidence of saturation). The underlying reason is the Pareto-like distribution of the abundance of genes in the plankton community, resulting in a very long tail of millions of genes present at remarkably low abundances, which can only be retrieved through massive sequencing. Microbial metagenomic projects retrieve a variable number of unique genes per Tera base-pair (Tbp), with a median value of 14.7 million unique genes per Tbp sequenced across projects. The increase in the rate of gene discovery in microbial metagenomes with sequencing effort implies that there is ample room for new gene discovery in further ocean and holobiont sequencing studies.
Collapse
Affiliation(s)
- Carlos M. Duarte
- King Abdullah University of Science and Technology (KAUST), Red Sea Research Centre (RSRC)Thuwal23955‐6900Saudi Arabia
- King Abdullah University of Science and Technology (KAUST)Computational Bioscience Research Center (CBRC), Thuwal, 23955‐6900, Saudi ArabiaThuwal23955‐6900Saudi Arabia
| | - David K. Ngugi
- King Abdullah University of Science and Technology (KAUST), Red Sea Research Centre (RSRC)Thuwal23955‐6900Saudi Arabia
- Leibniz Institute DSMZ ‐ German Collection of Microorganisms and Cell Cultures GmbH, Inhoffenstrasse 7B, D‐38124BraunschweigGermany
| | - Intikhab Alam
- King Abdullah University of Science and Technology (KAUST)Computational Bioscience Research Center (CBRC), Thuwal, 23955‐6900, Saudi ArabiaThuwal23955‐6900Saudi Arabia
| | - John Pearman
- King Abdullah University of Science and Technology (KAUST), Red Sea Research Centre (RSRC)Thuwal23955‐6900Saudi Arabia
| | - Allan Kamau
- Leibniz Institute DSMZ ‐ German Collection of Microorganisms and Cell Cultures GmbH, Inhoffenstrasse 7B, D‐38124BraunschweigGermany
| | - Victor M. Eguiluz
- Instituto de Física Interdisciplinar y Sistemas Complejos IFISC (CSIC‐UIB), E07122Palma de MallorcaSpain
| | - Takashi Gojobori
- Leibniz Institute DSMZ ‐ German Collection of Microorganisms and Cell Cultures GmbH, Inhoffenstrasse 7B, D‐38124BraunschweigGermany
| | | | - Josep M. Gasol
- Institut de Ciències del Mar, CSICBarcelonaSpain
- Centre for Marine Ecosystems Research, Edith Cowan UniversityJoondalupAustralia
| | - Vladimir Bajic
- Leibniz Institute DSMZ ‐ German Collection of Microorganisms and Cell Cultures GmbH, Inhoffenstrasse 7B, D‐38124BraunschweigGermany
| | - Xabier Irigoien
- King Abdullah University of Science and Technology (KAUST), Red Sea Research Centre (RSRC)Thuwal23955‐6900Saudi Arabia
- AZTI – Marine Research, Herrera Kaia, Portualdea z/gPasaia (Gipuzkoa)20110Spain
| |
Collapse
|
23
|
Ejigu GF, Jung J. Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing. BIOLOGY 2020; 9:E295. [PMID: 32962098 PMCID: PMC7565776 DOI: 10.3390/biology9090295] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 09/13/2020] [Accepted: 09/16/2020] [Indexed: 12/16/2022]
Abstract
Next-Generation Sequencing (NGS) has made it easier to obtain genome-wide sequence data and it has shifted the research focus into genome annotation. The challenging tasks involved in annotation rely on the currently available tools and techniques to decode the information contained in nucleotide sequences. This information will improve our understanding of general aspects of life and evolution and improve our ability to diagnose genetic disorders. Here, we present a summary of both structural and functional annotations, as well as the associated comparative annotation tools and pipelines. We highlight visualization tools that immensely aid the annotation process and the contributions of the scientific community to the annotation. Further, we discuss quality-control practices and the need for re-annotation, and highlight the future of annotation.
Collapse
Affiliation(s)
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si 17058, Gyeonggi-do, Korea;
| |
Collapse
|
24
|
Liu X, Guo Z, He T, Ren M. Prediction and analysis of prokaryotic promoters based on sequence features. Biosystems 2020; 197:104218. [PMID: 32755610 DOI: 10.1016/j.biosystems.2020.104218] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 07/03/2020] [Accepted: 07/21/2020] [Indexed: 10/23/2022]
Abstract
Promoter recognition is an important part of functional genomic annotation but a difficult problem. Many studies have been carried out to address this issue. However, they still cannot meet application needs. Most of the methods exhibit specificity, and the objects analyzed are relatively simple, especially for prokaryotes. Hence, more research on prokaryotic promoters is lacking. In this study, the similarity between gene expression and the transmission of information inspired us to analyze promoter sequences by calculating the information content of the sequences and the correlation between sequences in the subregion. We also calculated other sequence features as supplements, such as the Hurst exponent, GC content, and sequence bending property. Then, we employed an artificial neural network to build a classifier and applied it to identify promoters in three organisms, Escherichia coli, Bacillus subtilis, and Pseudomonas aeruginosa. The experiments on the benchmark test set indicate that our method has good capability to distinguish promoters from randomly selected nonpromoters. The maximal AUC for the classifier is 0.90, and the minimal AUC score is 0.80. Additionally, cross-species experiments were conducted. The AUC of the cross-experiment on three organisms yielded 0.8, suggesting that our approach has better generalization ability, which is conducive to revealing the more common characteristics of prokaryotic promoters.
Collapse
Affiliation(s)
- Xiao Liu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China.
| | - Zhirui Guo
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Ting He
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Meixiang Ren
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| |
Collapse
|
25
|
Raman Kumar M, Vaegae NK. A new numerical approach for DNA representation using modified Gabor wavelet transform for the identification of protein coding regions. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
26
|
Dehghannasiri R, Szabo L, Salzman J. Ambiguous splice sites distinguish circRNA and linear splicing in the human genome. Bioinformatics 2020; 35:1263-1268. [PMID: 30192918 DOI: 10.1093/bioinformatics/bty785] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Revised: 08/04/2018] [Accepted: 09/04/2018] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Identification of splice sites is critical to gene annotation and to determine which sequences control circRNA biogenesis. Full-length RNA transcripts could in principle complete annotations of introns and exons in genomes without external ontologies, i.e., ab initio. However, whether it is possible to reconstruct genomic positions where splicing occurs from full-length transcripts, even if sampled in the absence of noise, depends on the genome sequence composition. If it is not, there exist provable limits on the use of RNA-Seq to define splice locations (linear or circular) in the genome. RESULTS We provide a formal definition of splice site ambiguity due to the genomic sequence by introducing equivalent junction, which is the set of local genomic positions resulting in the same RNA sequence when joined through RNA splicing. We show that equivalent junctions are prevalent in diverse eukaryotic genomes and occur in 88.64% and 78.64% of annotated human splice sites in linear and circRNA junctions, respectively. The observed fractions of equivalent junctions and the frequency of many individual motifs are statistically significant when compared against the null distribution computed via simulation or closed-form. The frequency of equivalent junctions establishes a fundamental limit on the possibility of ab initio reconstruction of RNA transcripts without appealing to the ontology of "GT-AG" boundaries defining introns. Said differently, completely ab initio is impossible in the vast majority of splice sites in annotated circRNAs and linear transcripts. AVAILABILITY AND IMPLEMENTATION Two python scripts generating an equivalent junction sequence per junction are available at: https://github.com/salzmanlab/Equivalent-Junctions. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Linda Szabo
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Julia Salzman
- Department of Biochemistry, Stanford University, Stanford, CA, USA.,Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| |
Collapse
|
27
|
Al-Ajlan A, El Allali A. CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction. Interdiscip Sci 2019; 11:628-635. [PMID: 30588558 PMCID: PMC6841655 DOI: 10.1007/s12539-018-0313-4] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 11/22/2018] [Accepted: 12/07/2018] [Indexed: 12/30/2022]
Abstract
Accurate gene prediction in metagenomics fragments is a computationally challenging task due to the short-read length, incomplete, and fragmented nature of the data. Most gene-prediction programs are based on extracting a large number of features and then applying statistical approaches or supervised classification approaches to predict genes. In our study, we introduce a convolutional neural network for metagenomics gene prediction (CNN-MGP) program that predicts genes in metagenomics fragments directly from raw DNA sequences, without the need for manual feature extraction and feature selection stages. CNN-MGP is able to learn the characteristics of coding and non-coding regions and distinguish coding and non-coding open reading frames (ORFs). We train 10 CNN models on 10 mutually exclusive datasets based on pre-defined GC content ranges. We extract ORFs from each fragment; then, the ORFs are encoded numerically and inputted into an appropriate CNN model based on the fragment-GC content. The output from the CNN is the probability that an ORF will encode a gene. Finally, a greedy algorithm is used to select the final gene list. Overall, CNN-MGP is effective and achieves a 91% accuracy on testing dataset. CNN-MGP shows the ability of deep learning to predict genes in metagenomics fragments, and it achieves an accuracy higher than or comparable to state-of-the-art gene-prediction programs that use pre-defined features.
Collapse
Affiliation(s)
- Amani Al-Ajlan
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Achraf El Allali
- Computer Science Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
28
|
Godini R, Fallahi H. A brief overview of the concepts, methods and computational tools used in phylogenetic tree construction and gene prediction. Meta Gene 2019. [DOI: 10.1016/j.mgene.2019.100586] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
|
29
|
Romero S, Nastasa A, Chapman A, Kwong WK, Foster LJ. The honey bee gut microbiota: strategies for study and characterization. INSECT MOLECULAR BIOLOGY 2019; 28:455-472. [PMID: 30652367 DOI: 10.1111/imb.12567] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Gut microbiota research is an emerging field that improves our understanding of the ecological and functional dynamics of gut environments. The honey bee gut microbiota is a highly rewarding community to study, as honey bees are critical pollinators of many crops for human consumption and produce valuable commodities such as honey and wax. Most significantly, unique characteristics of the Apis mellifera gut habitat make it a valuable model system. This review discusses methods and pipelines used in the study of the gut microbiota of Ap. mellifera and closely related species for four main purposes: identifying microbiota taxonomy, characterizing microbiota genomes (microbiome), characterizing microbiota-microbiota interactions and identifying functions of the microbial community in the gut. The purpose of this contribution is to increase understanding of honey bee gut microbiota, to facilitate bee microbiota and microbiome research in general and to aid design of future experiments in this growing field.
Collapse
Affiliation(s)
- S Romero
- Michael Smith Laboratories and Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, Canada
| | - A Nastasa
- Michael Smith Laboratories and Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, Canada
| | - A Chapman
- Michael Smith Laboratories and Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, Canada
| | - W K Kwong
- Biodiversity Research Centre, Department of Botany, University of British Columbia, Vancouver, BC, Canada
| | - L J Foster
- Michael Smith Laboratories and Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
30
|
Hua Z, Early MJ. Closing target trimming and CTTdocker programs for discovering hidden superfamily loci in genomes. PLoS One 2019; 14:e0209468. [PMID: 31265455 PMCID: PMC6605638 DOI: 10.1371/journal.pone.0209468] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Accepted: 06/12/2019] [Indexed: 11/18/2022] Open
Abstract
The contemporary capacity of genome sequence analysis significantly lags behind the rapidly evolving sequencing technologies. Retrieving biological meaningful information from an ever-increasing amount of genome data would be significantly beneficial for functional genomic studies. For example, the duplication, organization, evolution, and function of superfamily genes are arguably important in many aspects of life. However, the incompleteness of annotations in many sequenced genomes often results in biased conclusions in comparative genomic studies of superfamilies. Here, we present a Perl software, called Closing Target Trimming (CTT), for automatically identifying most, if not all, members of a gene family in any sequenced genomes on CentOS 7 platform. To benefit a broader application on other operating systems, we also created a Docker application package, CTTdocker. Our test data on the F-box gene superfamily showed 78.2 and 79% gene finding accuracies in two well annotated plant genomes, Arabidopsis thaliana and rice, respectively. To further demonstrate the effectiveness of this program, we ran it through 18 plant genomes and five non-plant genomes to compare the expansion of the F-box and the BTB superfamilies. The program discovered that on average 12.7 and 9.3% of the total F-box and BTB members, respectively, are new loci in plant genomes, while it only found a small number of new members in vertebrate genomes. Therefore, different evolutionary and regulatory mechanisms of Cullin-RING ubiquitin ligases may be present in plants and animals. We also annotated and compared the Pkinase family members across a wide range of organisms, including 10 fungi, 10 metazoa, 10 vertebrates, and 10 additional plants, which were randomly selected from the Ensembl database. Our CTT annotation recovered on average 14% more loci, including pseudogenes, of the Pkinase superfamily in these 40 genomes, demonstrating its robust replicability and scalability in annotating superfamiy members in any genomes.
Collapse
Affiliation(s)
- Zhihua Hua
- Department of Environmental and Plant Biology, Ohio University, Athens, Ohio,United States of America
- Interdisciplinary Program in Molecular and Cellular Biology, Ohio University, Athens, Ohio, United States of America
- * E-mail:
| | - Matthew J. Early
- Department of Environmental and Plant Biology, Ohio University, Athens, Ohio,United States of America
- Department of Electrical Engineering and Computer Science, Ohio University, Athens, Ohio, United States of America
| |
Collapse
|
31
|
Madduru D, Ijaq J, Dhar S, Sarkar S, Poondla N, Das PS, Vasquez S, Suravajhala P. Systems Challenges of Hepatic Carcinomas: A Review. J Clin Exp Hepatol 2019; 9:233-244. [PMID: 31024206 PMCID: PMC6477144 DOI: 10.1016/j.jceh.2018.05.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Accepted: 05/10/2018] [Indexed: 12/12/2022] Open
Abstract
Hepatocellular Carcinoma (HCC) is ubiquitous in its prevalence in most of the developing countries. In the era of systems biology, multi-omics has evinced an extensive approach to define the underlying mechanism of disease progression. HCC is a multifactorial disease and the investigation of progression of liver cirrhosis becomes much extensive with cultivating omics approaches. We have performed a comprehensive review about such challenges in multi-omics approaches that are concerned to identify the immunological, genetics and epidemiological factors associated with HCC.
Collapse
Affiliation(s)
- Dhatri Madduru
- Department of Biochemistry, Osmania University, Hyderabad 500007, TG, India
- Bioclues.org
| | - Johny Ijaq
- Department of Genetics and Biotechnology, Osmania University, Hyderabad 500007, TG, India
- Bioclues.org
| | | | | | | | - Partha S. Das
- Bioclues.org
- Patient MD, Chicago, IL 60640-5710, United States
| | - Silvia Vasquez
- Bioclues.org
- Instituto Peruano de Energía Nuclear, Avenida Canadá 1470, Lima, Peru
| | - Prashanth Suravajhala
- Bioclues.org
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle 302001, RJ, India
| |
Collapse
|
32
|
Grollemund V, Pradat PF, Querin G, Delbot F, Le Chat G, Pradat-Peyre JF, Bede P. Machine Learning in Amyotrophic Lateral Sclerosis: Achievements, Pitfalls, and Future Directions. Front Neurosci 2019; 13:135. [PMID: 30872992 PMCID: PMC6403867 DOI: 10.3389/fnins.2019.00135] [Citation(s) in RCA: 89] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 02/06/2019] [Indexed: 12/23/2022] Open
Abstract
Background: Amyotrophic Lateral Sclerosis (ALS) is a relentlessly progressive neurodegenerative condition with limited therapeutic options at present. Survival from symptom onset ranges from 3 to 5 years depending on genetic, demographic, and phenotypic factors. Despite tireless research efforts, the core etiology of the disease remains elusive and drug development efforts are confounded by the lack of accurate monitoring markers. Disease heterogeneity, late-stage recruitment into pharmaceutical trials, and inclusion of phenotypically admixed patient cohorts are some of the key barriers to successful clinical trials. Machine Learning (ML) models and large international data sets offer unprecedented opportunities to appraise candidate diagnostic, monitoring, and prognostic markers. Accurate patient stratification into well-defined prognostic categories is another aspiration of emerging classification and staging systems. Methods: The objective of this paper is the comprehensive, systematic, and critical review of ML initiatives in ALS to date and their potential in research, clinical, and pharmacological applications. The focus of this review is to provide a dual, clinical-mathematical perspective on recent advances and future directions of the field. Another objective of the paper is the frank discussion of the pitfalls and drawbacks of specific models, highlighting the shortcomings of existing studies and to provide methodological recommendations for future study designs. Results: Despite considerable sample size limitations, ML techniques have already been successfully applied to ALS data sets and a number of promising diagnosis models have been proposed. Prognostic models have been tested using core clinical variables, biological, and neuroimaging data. These models also offer patient stratification opportunities for future clinical trials. Despite the enormous potential of ML in ALS research, statistical assumptions are often violated, the choice of specific statistical models is seldom justified, and the constraints of ML models are rarely enunciated. Conclusions: From a mathematical perspective, the main barrier to the development of validated diagnostic, prognostic, and monitoring indicators stem from limited sample sizes. The combination of multiple clinical, biofluid, and imaging biomarkers is likely to increase the accuracy of mathematical modeling and contribute to optimized clinical trial designs.
Collapse
Affiliation(s)
- Vincent Grollemund
- Laboratoire d'Informatique de Paris 6, Sorbonne University, Paris, France
- FRS Consulting, Paris, France
| | - Pierre-François Pradat
- Laboratoire d'Imagerie Biomédicale, INSERM, CNRS, Sorbonne Université, Paris, France
- APHP, Département de Neurologie, Hôpital Pitié-Salpêtrière, Centre Référent SLA, Paris, France
- Northern Ireland Center for Stratified Medecine, Biomedical Sciences Research Institute Ulster University, C-TRIC, Altnagelvin Hospital, Londonderry, United Kingdom
| | - Giorgia Querin
- Laboratoire d'Imagerie Biomédicale, INSERM, CNRS, Sorbonne Université, Paris, France
- APHP, Département de Neurologie, Hôpital Pitié-Salpêtrière, Centre Référent SLA, Paris, France
| | - François Delbot
- Laboratoire d'Informatique de Paris 6, Sorbonne University, Paris, France
- Département de Mathématiques et Informatique, Paris Nanterre University, Nanterre, France
| | | | - Jean-François Pradat-Peyre
- Laboratoire d'Informatique de Paris 6, Sorbonne University, Paris, France
- Département de Mathématiques et Informatique, Paris Nanterre University, Nanterre, France
- Modal'X, Paris Nanterre University, Nanterre, France
| | - Peter Bede
- Laboratoire d'Imagerie Biomédicale, INSERM, CNRS, Sorbonne Université, Paris, France
- APHP, Département de Neurologie, Hôpital Pitié-Salpêtrière, Centre Référent SLA, Paris, France
- Computational Neuroimaging Group, Trinity College, Dublin, Ireland
| |
Collapse
|
33
|
Abstract
Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon-the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed.
Collapse
Affiliation(s)
- Akhilesh Mishra
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
- Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India
| | - Priyanka Siwach
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
- Department of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India
| | - Poonam Singhal
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India
| | - B Jayaram
- Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India.
- Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India.
- Department of Chemistry, Indian Institute of Technology Delhi, New Delhi, India.
| |
Collapse
|
34
|
Al-Ajlan A, El Allali A. Feature selection for gene prediction in metagenomic fragments. BioData Min 2018; 11:9. [PMID: 30026811 PMCID: PMC6047368 DOI: 10.1186/s13040-018-0170-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2018] [Accepted: 05/01/2018] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. RESULTS In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read's GC content. CONCLUSION Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.
Collapse
Affiliation(s)
- Amani Al-Ajlan
- College of Computer and Information Sciences, Computer Science Department, King Saud University, Riyadh, Saudi Arabia
| | - Achraf El Allali
- College of Computer and Information Sciences, Computer Science Department, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
35
|
Alves P, Liu S, Wang D, Gerstein M. Multiple-Swarm Ensembles: Improving the Predictive Power and Robustness of Predictive Models and Its Use in Computational Biology. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:926-933. [PMID: 28391206 DOI: 10.1109/tcbb.2017.2691329] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Machine learning is an integral part of computational biology, and has already shown its use in various applications, such as prognostic tests. In the last few years in the non-biological machine learning community, ensembling techniques have shown their power in data mining competitions such as the Netflix challenge; however, such methods have not found wide use in computational biology. In this work, we endeavor to show how ensembling techniques can be applied to practical problems, including problems in the field of bioinformatics, and how they often outperform other machine learning techniques in both predictive power and robustness. Furthermore, we develop a methodology of ensembling, Multi-Swarm Ensemble (MSWE) by using multiple particle swarm optimizations and demonstrate its ability to further enhance the performance of ensembles.
Collapse
|
36
|
Singh A, Mishra A, Khosravi A, Khandelwal G, Jayaram B. Physico-chemical fingerprinting of RNA genes. Nucleic Acids Res 2017; 45:e47. [PMID: 27932456 PMCID: PMC5397174 DOI: 10.1093/nar/gkw1236] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2016] [Accepted: 11/29/2016] [Indexed: 12/13/2022] Open
Abstract
We advance here a novel concept for characterizing different classes of RNA genes on the basis of physico-chemical properties of DNA sequences. As knowledge-based approaches could yield unsatisfactory outcomes due to limitations of training on available experimental data sets, alternative approaches that utilize properties intrinsic to DNA are needed to supplement training based methods and to eventually provide molecular insights into genome organization. Based on a comprehensive series of molecular dynamics simulations of Ascona B-DNA consortium, we extracted hydrogen bonding, stacking and solvation energies of all combinations of DNA sequences at the dinucleotide level and calculated these properties for different types of RNA genes. Considering ∼7.3 million mRNA, 255 524 tRNA, 40 649 rRNA (different subunits) and 5250 miRNA, 3747 snRNA, gene sequences from 9282 complete genome chromosomes of all prokaryotes and eukaryotes available at NCBI, we observed that physico-chemical properties of different functional units on genomic DNA differ in their signatures.
Collapse
Affiliation(s)
- Ankita Singh
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
| | - Akhilesh Mishra
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India.,Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
| | - Ali Khosravi
- Ale-Taha Institute of Higher Education, Tehran, Iran
| | - Garima Khandelwal
- Cancer Research UK Manchester Institute, The University of Manchester, Wilmslow Road, Manchester M20 4BX, UK
| | - B Jayaram
- Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India.,Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India.,Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi-110016, India
| |
Collapse
|
37
|
Hernandez-Valladares M, Vaudel M, Selheim F, Berven F, Bruserud Ø. Proteogenomics approaches for studying cancer biology and their potential in the identification of acute myeloid leukemia biomarkers. Expert Rev Proteomics 2017; 14:649-663. [DOI: 10.1080/14789450.2017.1352474] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Affiliation(s)
- Maria Hernandez-Valladares
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Frode Selheim
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Frode Berven
- Proteomics Unit, Department of Biomedicine, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| | - Øystein Bruserud
- Department of Clinical Science, Faculty of Medicine and Dentistry, University of Bergen, Bergen, Norway
| |
Collapse
|
38
|
Marx H, Hahne H, Ulbrich SE, Schnieke A, Rottmann O, Frishman D, Kuster B. Annotation of the Domestic Pig Genome by Quantitative Proteogenomics. J Proteome Res 2017. [PMID: 28625053 DOI: 10.1021/acs.jproteome.7b00184] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
The pig is one of the earliest domesticated animals in the history of human civilization and represents one of the most important livestock animals. The recent sequencing of the Sus scrofa genome was a major step toward the comprehensive understanding of porcine biology, evolution, and its utility as a promising large animal model for biomedical and xenotransplantation research. However, the functional and structural annotation of the Sus scrofa genome is far from complete. Here, we present mass spectrometry-based quantitative proteomics data of nine juvenile organs and six embryonic stages between 18 and 39 days after gestation. We found that the data provide evidence for and improve the annotation of 8176 protein-coding genes including 588 novel and 321 refined gene models. The analysis of tissue-specific proteins and the temporal expression profiles of embryonic proteins provides an initial functional characterization of expressed protein interaction networks and modules including as yet uncharacterized proteins. Comparative transcript and protein expression analysis to human organs reveal a moderate conservation of protein translation across species. We anticipate that this resource will facilitate basic and applied research on Sus scrofa as well as its porcine relatives.
Collapse
Affiliation(s)
| | | | | | | | | | - Dmitrij Frishman
- Institute of Bioinformatics and Systems Biology , German Research Center for Environmental Health, Neuherberg, Germany.,St Petersburg State Polytechnical University , St Petersburg, Russia
| | - Bernhard Kuster
- Center for Integrated Protein Science Munich , Munich, Germany
| |
Collapse
|
39
|
Mossotto E, Ashton JJ, Coelho T, Beattie RM, MacArthur BD, Ennis S. Classification of Paediatric Inflammatory Bowel Disease using Machine Learning. Sci Rep 2017; 7:2427. [PMID: 28546534 PMCID: PMC5445076 DOI: 10.1038/s41598-017-02606-2] [Citation(s) in RCA: 107] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Accepted: 04/12/2017] [Indexed: 02/07/2023] Open
Abstract
Paediatric inflammatory bowel disease (PIBD), comprising Crohn's disease (CD), ulcerative colitis (UC) and inflammatory bowel disease unclassified (IBDU) is a complex and multifactorial condition with increasing incidence. An accurate diagnosis of PIBD is necessary for a prompt and effective treatment. This study utilises machine learning (ML) to classify disease using endoscopic and histological data for 287 children diagnosed with PIBD. Data were used to develop, train, test and validate a ML model to classify disease subtype. Unsupervised models revealed overlap of CD/UC with broad clustering but no clear subtype delineation, whereas hierarchical clustering identified four novel subgroups characterised by differing colonic involvement. Three supervised ML models were developed utilising endoscopic data only, histological only and combined endoscopic/histological data yielding classification accuracy of 71.0%, 76.9% and 82.7% respectively. The optimal combined model was tested on a statistically independent cohort of 48 PIBD patients from the same clinic, accurately classifying 83.3% of patients. This study employs mathematical modelling of endoscopic and histological data to aid diagnostic accuracy. While unsupervised modelling categorises patients into four subgroups, supervised approaches confirm the need of both endoscopic and histological evidence for an accurate diagnosis. Overall, this paper provides a blueprint for ML use with clinical data.
Collapse
Affiliation(s)
- E Mossotto
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - J J Ashton
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton, UK
| | - T Coelho
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK
- Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton, UK
| | - R M Beattie
- Department of Paediatric Gastroenterology, Southampton Children's Hospital, Southampton, UK
| | - B D MacArthur
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - S Ennis
- Human Genetics and Genomic Medicine, University of Southampton, Southampton, UK.
| |
Collapse
|
40
|
Alt-Splice Gene Predictor Using Multitrack-Clique Analysis: Verification of Statistical Support for Modelling in Genomes of Multicellular Eukaryotes. INFORMATICS 2017. [DOI: 10.3390/informatics4010003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
41
|
Datta KK, Patil AH, Patel K, Dey G, Madugundu AK, Renuse S, Kaviyil JE, Sekhar R, Arunima A, Daswani B, Kaur I, Mohanty J, Sinha R, Jaiswal S, Sivapriya S, Sonnathi Y, Chattoo BB, Gowda H, Ravikumar R, Prasad TSK. Proteogenomics of Candida tropicalis--An Opportunistic Pathogen with Importance for Global Health. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2017; 20:239-47. [PMID: 27093108 DOI: 10.1089/omi.2015.0197] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
The frequency of Candida infections is currently rising, and thus adversely impacting global health. The situation is exacerbated by azole resistance developed by fungal pathogens. Candida tropicalis is an opportunistic pathogen that causes candidiasis, for example, in immune-compromised individuals, cancer patients, and those who undergo organ transplantation. It is a member of the non-albicans group of Candida that are known to be azole-resistant, and is frequently seen in individuals being treated for cancers, HIV-infection, and those who underwent bone marrow transplantation. Although the genome of C. tropicalis was sequenced in 2009, the genome annotation has not been supported by experimental validation. In the present study, we have carried out proteomics profiling of C. tropicalis using high-resolution Fourier transform mass spectrometry. We identified 2743 proteins, thus mapping nearly 44% of the computationally predicted protein-coding genes with peptide level evidence. In addition to identifying 2591 proteins in the cell lysate of this yeast, we also analyzed the proteome of the conditioned media of C. tropicalis culture and identified several unique secreted proteins among a total of 780 proteins. By subjecting the mass spectrometry data derived from cell lysate and conditioned media to proteogenomic analysis, we identified 86 novel genes, 12 novel exons, and corrected 49 computationally-predicted gene models. To our knowledge, this is the first high-throughput proteomics study of C. tropicalis validating predicted protein coding genes and refining the current genome annotation. The findings may prove useful in future global health efforts to fight against Candida infections.
Collapse
Affiliation(s)
- Keshava K Datta
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,2 School of Biotechnology, KIIT University , Bhubaneswar, India
| | - Arun H Patil
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,2 School of Biotechnology, KIIT University , Bhubaneswar, India
| | - Krishna Patel
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,3 Amrita School of Biotechnology, Amrita Vishwa Vidyapeetham , Kollam, India
| | - Gourav Dey
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,4 Manipal University , Madhav Nagar, Manipal, India
| | - Anil K Madugundu
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,5 Centre for Bioinformatics, School of Life Sciences, Pondicherry University , Puducherry, India
| | - Santosh Renuse
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,3 Amrita School of Biotechnology, Amrita Vishwa Vidyapeetham , Kollam, India
| | - Jyothi E Kaviyil
- 6 Department of Neuromicrobiology, Neurobiology Research Centre, National Institute of Mental Health and Neurosciences , Bangalore, India
| | - Raja Sekhar
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,5 Centre for Bioinformatics, School of Life Sciences, Pondicherry University , Puducherry, India
| | | | - Bhavna Daswani
- 7 National Institute for Research in Reproductive Health (ICMR) , Parel, Mumbai, India
| | - Inderjeet Kaur
- 8 Malaria Research Group, International Center for Genetic Engineering and Biotechnology (ICGEB) , New Delhi, India
| | - Jyotirmaya Mohanty
- 9 ICAR-Central Institute of Freshwater Aquaculture , Kausalyaganga, Bhubaneswar, India
| | | | | | - S Sivapriya
- 11 Department of Ocular Pathology, Vision Research Foundation , Chennai, India
| | | | - Bharat B Chattoo
- 13 Centre for Genome Research, Department of Microbiology and Biotechnology Centre, Faculty of Science, The M. S. University of Baroda , Vadodara, India
| | - Harsha Gowda
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,2 School of Biotechnology, KIIT University , Bhubaneswar, India .,14 YU-IOB Center for Systems Biology and Molecular Medicine, Yenepoya University , Mangalore, India
| | - Raju Ravikumar
- 6 Department of Neuromicrobiology, Neurobiology Research Centre, National Institute of Mental Health and Neurosciences , Bangalore, India
| | - T S Keshava Prasad
- 1 Institute of Bioinformatics , International Technology Park, Bangalore, India.,14 YU-IOB Center for Systems Biology and Molecular Medicine, Yenepoya University , Mangalore, India .,15 NIMHANS-IOB Proteomics and Bioinformatics Laboratory, Neurobiology Research Centre, National Institute of Mental Health and Neurosciences , Bangalore, India
| |
Collapse
|
42
|
Díez P, Fuentes M. Proteogenomics for the Comprehensive Analysis of Human Cellular and Serum Antibody Repertoires. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:153-162. [PMID: 27686811 DOI: 10.1007/978-3-319-42316-6_10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The vast repertoire of immunoglobulins produced by the immune system is a consequence of the huge amount of antigens to which we are exposed every day. The diversity of these immunoglobulins is due to different mechanisms (including VDJ recombination, somatic hypermutation, and antigen selection). Understanding how the immune system is capable of generating this diversity and which are the molecular bases of the composition of immunoglobulins are key challenges in the immunological field. During the last decades, several techniques have emerged as promising strategies to achieve these goals, but it is their combination which appears to be the fruitful solution for increasing the knowledge about human cellular and serum antibody repertoires.In this chapter, we address the diverse strategies focused on the analysis of immunoglobulin repertoires as well as the characterization of the genomic and peptide sequences. Moreover, the advantages of combining various -omics approaches are discussed through review different published studies, showing the benefits in clinical areas.
Collapse
Affiliation(s)
- Paula Díez
- Department of Medicine and General Cytometry Service-Nucleus, Cancer Research Centre (IBMCC/CSIC/USAL/IBSAL), Avda. Universidad de Coimbra, S/N 37007, Salamanca, Spain.,Proteomics Unit, Cancer Research Centre (IBMCC/CSIC/USAL/IBSAL), Avda. Universidad de Coimbra, S/N 37007, Salamanca, Spain
| | - Manuel Fuentes
- Department of Medicine and General Cytometry Service-Nucleus, Cancer Research Centre (IBMCC/CSIC/USAL/IBSAL), Avda. Universidad de Coimbra, S/N 37007, Salamanca, Spain. .,Proteomics Unit, Cancer Research Centre (IBMCC/CSIC/USAL/IBSAL), Avda. Universidad de Coimbra, S/N 37007, Salamanca, Spain.
| |
Collapse
|
43
|
Abstract
We examine exon junctions near apparent amino acid insertions and deletions in alignments of orthologous protein-coding genes. In 1,917 ortholog families across nine oomycete genomes, 10–20% of introns are near an alignment gap, indicating at first sight that splice-site displacements are frequent. We designed a robust algorithmic procedure for the delineation of intron-containing homologous regions, and combined it with a parsimony-based reconstruction of intron loss, gain, and splice-site shift events on a phylogeny. The reconstruction implies that 12% of introns underwent an acceptor-site shift, and 10% underwent a donor-site shift. In order to offset gene annotation problems, we amended the procedure with the reannotation of intron boundaries using alignment evidence. The corresponding reconstruction involves much fewer intron gain and splice-site shift events. The frequency of acceptor- and donor-side shifts drops to 4% and 3%, respectively, which are not much different from what one would expect by random codon insertions and deletions. In other words, gaps near exon junctions are mostly artifacts of gene annotation rather than evidence of sliding intron boundaries. Our study underscores the importance of using well-supported gene structure annotations in comparative studies. When transcription evidence is not available, we propose a robust ancestral reconstruction procedure that corrects misannotated intron boundaries using sequence alignments. The results corroborate the view that boundary shifts and complete intron sliding are only accidental in eukaryotic genome evolution and have a negligible impact on protein diversity.
Collapse
Affiliation(s)
- Steven Sêton Bocco
- Department of Biochemistry and Molecular Medicine, University of Montréal, Montréal, Canada
| | - Miklós Csűrös
- Department of Computer Science and Operations Research, University of Montréal, Montréal, Canada Institute of Genetics, Biological Research Centre, Hungarian Academy of Sciences, Szeged, Hungary
| |
Collapse
|
44
|
Xu D, Pavlidis P, Thamadilok S, Redwood E, Fox S, Blekhman R, Ruhl S, Gokcumen O. Recent evolution of the salivary mucin MUC7. Sci Rep 2016; 6:31791. [PMID: 27558399 PMCID: PMC4997351 DOI: 10.1038/srep31791] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Accepted: 07/26/2016] [Indexed: 11/23/2022] Open
Abstract
Genomic structural variants constitute the majority of variable base pairs in primate genomes and affect gene function in multiple ways. While whole gene duplications and deletions are relatively well-studied, the biology of subexonic (i.e., within coding exon sequences), copy number variation remains elusive. The salivary MUC7 gene provides an opportunity for studying such variation, as it harbors copy number variable subexonic repeat sequences that encode for densely O-glycosylated domains (PTS-repeats) with microbe-binding properties. To understand the evolution of this gene, we analyzed mammalian and primate genomes within a comparative framework. Our analyses revealed that (i) MUC7 has emerged in the placental mammal ancestor and rapidly gained multiple sites for O-glycosylation; (ii) MUC7 has retained its extracellular activity in saliva in placental mammals; (iii) the anti-fungal domain of the protein was remodified under positive selection in the primate lineage; and (iv) MUC7 PTS-repeats have evolved recurrently and under adaptive constraints. Our results establish MUC7 as a major player in salivary adaptation, likely as a response to diverse pathogenic exposure in primates. On a broader scale, our study highlights variable subexonic repeats as a primary source for modular evolutionary innovation that lead to rapid functional adaptation.
Collapse
Affiliation(s)
- Duo Xu
- Department of Biological Sciences, State University of New York at Buffalo, New York 14260, USA
| | - Pavlos Pavlidis
- Institute of Computer Science (ICS), Foundation of Research and Technology-Hellas, Heraklion, Crete, Greece
| | - Supaporn Thamadilok
- Department of Oral Biology, School of Dental Medicine, State University of New York at Buffalo, New York 14214, USA
| | - Emilie Redwood
- Department of Biological Sciences, State University of New York at Buffalo, New York 14260, USA
| | - Sara Fox
- Department of Biological Sciences, State University of New York at Buffalo, New York 14260, USA
| | - Ran Blekhman
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Twin Cities, Minnesota 55455, USA
| | - Stefan Ruhl
- Department of Oral Biology, School of Dental Medicine, State University of New York at Buffalo, New York 14214, USA
| | - Omer Gokcumen
- Department of Biological Sciences, State University of New York at Buffalo, New York 14260, USA
| |
Collapse
|
45
|
Klasberg S, Bitard-Feildel T, Mallet L. Computational Identification of Novel Genes: Current and Future Perspectives. Bioinform Biol Insights 2016; 10:121-31. [PMID: 27493475 PMCID: PMC4970615 DOI: 10.4137/bbi.s39950] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Revised: 05/31/2016] [Accepted: 06/05/2016] [Indexed: 12/31/2022] Open
Abstract
While it has long been thought that all genomic novelties are derived from the existing material, many genes lacking homology to known genes were found in recent genome projects. Some of these novel genes were proposed to have evolved de novo, ie, out of noncoding sequences, whereas some have been shown to follow a duplication and divergence process. Their discovery called for an extension of the historical hypotheses about gene origination. Besides the theoretical breakthrough, increasing evidence accumulated that novel genes play important roles in evolutionary processes, including adaptation and speciation events. Different techniques are available to identify genes and classify them as novel. Their classification as novel is usually based on their similarity to known genes, or lack thereof, detected by comparative genomics or against databases. Computational approaches are further prime methods that can be based on existing models or leveraging biological evidences from experiments. Identification of novel genes remains however a challenging task. With the constant software and technologies updates, no gold standard, and no available benchmark, evaluation and characterization of genomic novelty is a vibrant field. In this review, the classical and state-of-the-art tools for gene prediction are introduced. The current methods for novel gene detection are presented; the methodological strategies and their limits are discussed along with perspective approaches for further studies.
Collapse
Affiliation(s)
- Steffen Klasberg
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Huefferstrasse 1, Muenster, Germany
| | - Tristan Bitard-Feildel
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Huefferstrasse 1, Muenster, Germany
| | - Ludovic Mallet
- Institute for Evolution and Biodiversity, Westfalian Wilhelms University Muenster, Huefferstrasse 1, Muenster, Germany
| |
Collapse
|
46
|
Huang Y, Chen SY, Deng F. Well-characterized sequence features of eukaryote genomes and implications for ab initio gene prediction. Comput Struct Biotechnol J 2016; 14:298-303. [PMID: 27536341 PMCID: PMC4975701 DOI: 10.1016/j.csbj.2016.07.002] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2016] [Revised: 07/06/2016] [Accepted: 07/12/2016] [Indexed: 12/31/2022] Open
Abstract
In silico analysis of DNA sequences is an important area of computational biology in the post-genomic era. Over the past two decades, computational approaches for ab initio prediction of gene structure from genome sequence alone have largely facilitated our understanding on a variety of biological questions. Although the computational prediction of protein-coding genes has already been well-established, we are also facing challenges to robustly find the non-coding RNA genes, such as miRNA and lncRNA. Two main aspects of ab initio gene prediction include the computed values for describing sequence features and used algorithm for training the discriminant function, and by which different combinations are employed into various bioinformatic tools. Herein, we briefly review these well-characterized sequence features in eukaryote genomes and applications to ab initio gene prediction. The main purpose of this article is to provide an overview to beginners who aim to develop the related bioinformatic tools.
Collapse
Affiliation(s)
- Ying Huang
- College of Veterinary Medicine, Sichuan Agricultural University, Chengdu 611130, China
| | - Shi-Yi Chen
- Farm Animal Genetic Resources Exploration and Innovation Key Laboratory of Sichuan Province, Sichuan Agricultural University, Chengdu 611130, China
- Corresponding author at: Farm Animal Genetic Resources Exploration and Innovation Key Laboratory of Sichuan Province, Sichuan Agricultural University, 211# Huimin Road, Wenjiang 611130, Sichuan, China.Farm Animal Genetic Resources Exploration and Innovation Key Laboratory of Sichuan ProvinceSichuan Agricultural University211# Huimin RoadWenjiangSichuan611130China
| | - Feilong Deng
- Farm Animal Genetic Resources Exploration and Innovation Key Laboratory of Sichuan Province, Sichuan Agricultural University, Chengdu 611130, China
| |
Collapse
|
47
|
Granados-Riveron JT, Aquino-Jarquin G. The complexity of the translation ability of circRNAs. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2016; 1859:1245-51. [PMID: 27449861 DOI: 10.1016/j.bbagrm.2016.07.009] [Citation(s) in RCA: 154] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Revised: 06/21/2016] [Accepted: 07/15/2016] [Indexed: 12/12/2022]
Abstract
Circular RNAs (circRNAs) are a new class of long non-coding RNAs that play a potential role in gene expression regulation, acting as efficient microRNAs sponges. The latest surprise concerning circRNAs is that we now know that they can serve as transcriptional activators in human cells, indicating that circRNAs are involved in important regulatory tasks. Recently, new insight has been gained about the coding potential of circular viroid RNAs, as well as the presence of Internal Ribosomal Entry Sites (IRES) allowing the formation of peptides or proteins from circular RNA. Here, we discuss the current state of our knowledge regarding evidence supporting the hypothesis that circRNAs serve as protein-coding sequences in vitro and in vivo. Also, we remark on the difficulties of their identification and highlight some tools currently available for exploring the coding potential of circRNA.
Collapse
Affiliation(s)
- Javier T Granados-Riveron
- Laboratorio de Investigación en Genómica, Genética y Bioinformática, Torre de Hemato-Oncología, 4to Piso, Sección 2, Hospital Infantil de México, Federico Gómez, Mexico
| | - Guillermo Aquino-Jarquin
- Laboratorio de Investigación en Genómica, Genética y Bioinformática, Torre de Hemato-Oncología, 4to Piso, Sección 2, Hospital Infantil de México, Federico Gómez, Mexico.
| |
Collapse
|
48
|
Singh S, Kaur S, Goel N. A Review of Computational Intelligence Methods for Eukaryotic Promoter Prediction. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2016; 34:449-62. [PMID: 26158565 DOI: 10.1080/15257770.2015.1013126] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
In past decades, prediction of genes in DNA sequences has attracted the attention of many researchers but due to its complex structure it is extremely intricate to correctly locate its position. A large number of regulatory regions are present in DNA that helps in transcription of a gene. Promoter is one such region and to find its location is a challenging problem. Various computational methods for promoter prediction have been developed over the past few years. This paper reviews these promoter prediction methods. Several difficulties and pitfalls encountered by these methods are also detailed, along with future research directions.
Collapse
Affiliation(s)
- Shailendra Singh
- a Department of Computer Science and Engineering , PEC University of Technology , Chandigarh , India
| | | | | |
Collapse
|
49
|
Neuhaus K, Landstorfer R, Fellner L, Simon S, Schafferhans A, Goldberg T, Marx H, Ozoline ON, Rost B, Kuster B, Keim DA, Scherer S. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC). BMC Genomics 2016; 17:133. [PMID: 26911138 PMCID: PMC4765031 DOI: 10.1186/s12864-016-2456-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 02/09/2016] [Indexed: 12/30/2022] Open
Abstract
Background Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Results Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. Conclusions These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2456-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Klaus Neuhaus
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Richard Landstorfer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Lea Fellner
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Svenja Simon
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
| | - Andrea Schafferhans
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Tatyana Goldberg
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Harald Marx
- Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany.
| | - Olga N Ozoline
- Institute of Cell Biophysics, Russian Academy of Sciences, Moscow Region, 142290, Pushchino, Russia.
| | - Burkhard Rost
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Bernhard Kuster
- Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany. .,Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), Technische Universität München, Gregor-Mendel-Str. 4, 85354, Freising, Germany.
| | - Daniel A Keim
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
| | - Siegfried Scherer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| |
Collapse
|
50
|
Mouilleron H, Delcourt V, Roucou X. Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res 2016; 44:14-23. [PMID: 26578573 PMCID: PMC4705651 DOI: 10.1093/nar/gkv1218] [Citation(s) in RCA: 72] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Revised: 10/26/2015] [Accepted: 10/28/2015] [Indexed: 12/13/2022] Open
Abstract
mRNAs carry the genetic information that is translated by ribosomes. The traditional view of a mature eukaryotic mRNA is a molecule with three main regions, the 5' UTR, the protein coding open reading frame (ORF) or coding sequence (CDS), and the 3' UTR. This concept assumes that ribosomes translate one ORF only, generally the longest one, and produce one protein. As a result, in the early days of genomics and bioinformatics, one CDS was associated with each protein-coding gene. This fundamental concept of a single CDS is being challenged by increasing experimental evidence indicating that annotated proteins are not the only proteins translated from mRNAs. In particular, mass spectrometry (MS)-based proteomics and ribosome profiling have detected productive translation of alternative open reading frames. In several cases, the alternative and annotated proteins interact. Thus, the expression of two or more proteins translated from the same mRNA may offer a mechanism to ensure the co-expression of proteins which have functional interactions. Translational mechanisms already described in eukaryotic cells indicate that the cellular machinery is able to translate different CDSs from a single viral or cellular mRNA. In addition to summarizing data showing that the protein coding potential of eukaryotic mRNAs has been underestimated, this review aims to challenge the single translated CDS dogma.
Collapse
Affiliation(s)
- Hélène Mouilleron
- Department of biochemistry, Université de Sherbrooke, Sherbrooke, Quebec J1E 4K8, Canada PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Quebec, Canada
| | - Vivian Delcourt
- Department of biochemistry, Université de Sherbrooke, Sherbrooke, Quebec J1E 4K8, Canada PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Quebec, Canada Inserm U-1192, Laboratoire de Protéomique, Réponse Inflammatoire, Spectrométrie de Masse (PRISM), Université de Lille 1, Cité Scientifique, 59655 Villeneuve D'Ascq, France
| | - Xavier Roucou
- Department of biochemistry, Université de Sherbrooke, Sherbrooke, Quebec J1E 4K8, Canada PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Quebec, Canada
| |
Collapse
|