1
|
Taleb NN, Zalloua P, Elbassioni K, Hatzikirou H, Henschel A, Platt DE. Informational rescaling of PCA maps with application to genetic distance. Comput Struct Biotechnol J 2024; 27:48-56. [PMID: 39802212 PMCID: PMC11719279 DOI: 10.1016/j.csbj.2024.11.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 11/19/2024] [Accepted: 11/26/2024] [Indexed: 01/16/2025] Open
Abstract
Principal Component Analysis (PCA) is a powerful multivariate tool allowing the projection of data in low-dimensional representations. Nevertheless, datapoint distances on these low-dimensional projections are challenging to interpret. Here, we propose a computationally simple heuristic to transform a map based on standard PCA (when the variables are asymptotically Gaussian) into an entropy-based map where distances are based on mutual information (MI). Moreover, we show that in certain instances our proposed scaled PCA can improve cluster identification. Rescaling principal component-based distances using MI results in a representation of relative statistical associations when, as in genetics, it is applied on bit measurements between individuals' genomic mutual information. This entropy-rescaled PCA, while preserving order relationships (along a dimension), quantifies relative distances into information units, such as "bits". We illustrate the effect of this rescaling using genomics data derived from world populations and describe how the interpretation of results is impacted.
Collapse
Affiliation(s)
- Nassim Nicholas Taleb
- Risk Engineering, School of Engineering, New York, USA
- Maroun Semaan Faculty of Engineering and Architecture, American University of Beirut, Beirut, Lebanon
| | - Pierre Zalloua
- College of Medicine and Health Sciences, Dept of Public Health and Epidemiology, Khalifa University, Abu Dhabi, United Arab Emirates
- Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Khaled Elbassioni
- College of Computing and Mathematical Sciences, Dept. of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates
- Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Haralampos Hatzikirou
- Center for Interdisciplinary Digital Sciences (CIDS), Department Information Services and High Performance Computing (ZIH), TUD Dresden University of Technology, Dresden, Germany
- College of Computing and Mathematical Sciences, Dept of Mathematics, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- College of Computing and Mathematical Sciences, Dept. of Computer Science, Khalifa University, Abu Dhabi, United Arab Emirates
- Center for Cyber-Physical Systems, Khalifa University, Abu Dhabi, United Arab Emirates
| | | |
Collapse
|
2
|
Bonidia RP, Avila Santos AP, de Almeida BLS, Stadler PF, Nunes da Rocha U, Sanches DS, de Carvalho ACPLF. Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1398. [PMID: 37420418 DOI: 10.3390/e24101398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/16/2022] [Accepted: 09/24/2022] [Indexed: 07/09/2023]
Abstract
In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Anderson P Avila Santos
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, 04318 Leipzig, Germany
| | - Breno L S de Almeida
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Peter F Stadler
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, 04107 Leipzig, Germany
| | - Ulisses Nunes da Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, 04318 Leipzig, Germany
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology-Paraná-UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
3
|
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 2022; 23:bbab434. [PMID: 34750626 PMCID: PMC8769707 DOI: 10.1093/bib/bbab434] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/18/2021] [Accepted: 09/20/2021] [Indexed: 12/24/2022] Open
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Douglas S Domingues
- Group of Genomics and Transcriptomes in Plants, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
4
|
Stevens DM, Tang A, Coaker G. A Genetic Toolkit for Investigating Clavibacter Species: Markerless Deletion, Permissive Site Identification, and an Integrative Plasmid. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2021; 34:1336-1345. [PMID: 34890250 DOI: 10.1094/mpmi-07-21-0171-ta] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The development of knockout mutants and expression variants are critical for understanding genotype-phenotype relationships. However, advances in these techniques in gram-positive actinobacteria have stagnated over the last decade. Actinobacteria in the Clavibacter genus are composed of diverse crop pathogens that cause a variety of wilt and cankering diseases. Here, we present a suite of tools for genetic manipulation in the tomato pathogen Clavibacter michiganensis including a markerless deletion system, an integrative plasmid, and an R package for identification of permissive sites for plasmid integration. The vector pSelAct-KO is a recombination-based, markerless knockout system that uses dual selection to engineer seamless deletions of a region of interest, providing opportunities for repeated higher-order genetic knockouts. The efficacy of pSelAct-KO was demonstrated in C. michiganensis and was confirmed using whole-genome sequencing. We developed permissR, an R package to identify permissive sites for chromosomal integration, which can be used in conjunction with pSelAct-Express, a nonreplicating integrative plasmid that enables recombination into a permissive genomic location. Expression of enhanced green fluorescent protein by pSelAct-Express was verified in two candidate permissive regions predicted by permissR in C. michiganensis. These molecular tools are essential advances for investigating gram-positive actinobacteria, particularly for important pathogens in the Clavibacter genus.[Formula: see text] Copyright © 2021 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license.
Collapse
Affiliation(s)
- Danielle M Stevens
- Integrative Genetics and Genomics Graduate Group, University of California, Davis, Davis, CA 95616, U.S.A
- Department of Plant Pathology, University of California, Davis, Davis, CA 95616, U.S.A
| | - Andrea Tang
- Department of Plant Pathology, University of California, Davis, Davis, CA 95616, U.S.A
| | - Gitta Coaker
- Department of Plant Pathology, University of California, Davis, Davis, CA 95616, U.S.A
| |
Collapse
|
5
|
Information Entropy in Chemistry: An Overview. ENTROPY 2021; 23:e23101240. [PMID: 34681964 PMCID: PMC8534366 DOI: 10.3390/e23101240] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/19/2021] [Accepted: 09/20/2021] [Indexed: 12/20/2022]
Abstract
Basic applications of the information entropy concept to chemical objects are reviewed. These applications deal with quantifying chemical and electronic structures of molecules, signal processing, structural studies on crystals, and molecular ensembles. Recent advances in the mentioned areas make information entropy a central concept in interdisciplinary studies on digitalizing chemical reactions, chemico-information synthesis, crystal engineering, as well as digitally rethinking basic notions of structural chemistry in terms of informatics.
Collapse
|
6
|
Lambrou GI, Zaravinos A, Ioannidou P, Koutsouris D. Information, Thermodynamics and Life: A Narrative Review. APPLIED SCIENCES 2021; 11:3897. [DOI: 10.3390/app11093897] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2025]
Abstract
Information is probably one of the most difficult physical quantities to comprehend. This applies not only to the very definition of information, but also to the physical entity of information, meaning how can it be quantified and measured. In recent years, information theory and its function in systems has been an intense field of study, due to the large increase of available information technology, where the notion of bit dominated the information discipline. Information theory also expanded from the “simple” “bit” to the quantal “qubit”, which added more variables for consideration. One of the main applications of information theory could be considered the field of “autonomy”, which is the main characteristic of living organisms in nature since they all have self-sustainability, motion and self-protection. These traits, along with the ability to be aware of existence, make it difficult and complex to simulate in artificial constructs. There are many approaches to the concept of simulating autonomous behavior, yet there is no conclusive approach to a definite solution to this problem. Recent experimental results have shown that the interaction between machines and neural cells is possible and it consists of a significant tool for the study of complex systems. The present work tries to review the question on the interactions between information and life. It attempts to build a connection between information and thermodynamics in terms of energy consumption and work production, as well as present some possible applications of these physical quantities.
Collapse
Affiliation(s)
- George I. Lambrou
- Choremeio Research Laboratory, First Department of Pediatrics, National and Kapodistrian University of Athens, Thivon & Levadeias 8, Goudi, 11527 Athens, Greece
- Biomedical Engineering Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, Heroon Polytechneiou 9, Zografou, 15780 Athens, Greece
| | - Apostolos Zaravinos
- Department of Basic Medical Sciences, College of Medicine, Member of QU Health, Qatar University, Doha P.O. Box 2713, Qatar
| | - Penelope Ioannidou
- Biomedical Engineering Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, Heroon Polytechneiou 9, Zografou, 15780 Athens, Greece
| | - Dimitrios Koutsouris
- Biomedical Engineering Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens, Heroon Polytechneiou 9, Zografou, 15780 Athens, Greece
| |
Collapse
|
7
|
Ameri AJ, Lewis ZA. Shannon entropy as a metric for conditional gene expression in Neurospora crassa. G3-GENES GENOMES GENETICS 2021; 11:6159613. [PMID: 33751112 PMCID: PMC8049430 DOI: 10.1093/g3journal/jkab055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Accepted: 02/09/2021] [Indexed: 12/04/2022]
Abstract
Neurospora crassa has been an important model organism for molecular biology and genetics for over 60 years. Neurospora crassa has a complex life cycle, with over 28 distinct cell types and is capable of transcriptional responses to many environmental conditions including nutrient availability, temperature, and light. To quantify variation in N. crassa gene expression, we analyzed public expression data from 97 conditions and calculated the Shannon Entropy value for Neurospora’s approximately 11,000 genes. Entropy values can be used to estimate the variability in expression for a single gene over a range of conditions and be used to classify individual genes as constitutive or condition-specific. Shannon entropy has previously been used measure the degree of tissue specificity of multicellular plant or animal genes. We use this metric here to measure variable gene expression in a microbe and provide this information as a resource for the N. crassa research community. Finally, we demonstrate the utility of this approach by using entropy values to identify genes with constitutive expression across a wide range of conditions and to identify genes that are activated exclusively during sexual development.
Collapse
Affiliation(s)
- Abigail J Ameri
- Department of Microbiology, University of Georgia, Athens, GA 30602, USA
| | - Zachary A Lewis
- Department of Microbiology, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
8
|
Sirén K, Millard A, Petersen B, Gilbert M, Clokie MRJ, Sicheritz-Pontén T. Rapid discovery of novel prophages using biological feature engineering and machine learning. NAR Genom Bioinform 2021; 3:lqaa109. [PMID: 33575651 PMCID: PMC7787355 DOI: 10.1093/nargab/lqaa109] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 12/07/2020] [Accepted: 12/11/2020] [Indexed: 01/10/2023] Open
Abstract
Prophages are phages that are integrated into bacterial genomes and which are key to understanding many aspects of bacterial biology. Their extreme diversity means they are challenging to detect using sequence similarity, yet this remains the paradigm and thus many phages remain unidentified. We present a novel, fast and generalizing machine learning method based on feature space to facilitate novel prophage discovery. To validate the approach, we reanalyzed publicly available marine viromes and single-cell genomes using our feature-based approaches and found consistently more phages than were detected using current state-of-the-art tools while being notably faster. This demonstrates that our approach significantly enhances bacteriophage discovery and thus provides a new starting point for exploring new biologies.
Collapse
Affiliation(s)
- Kimmo Sirén
- Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen,1353 Denmark
| | - Andrew Millard
- Department of Genetics and Genome Biology, University of Leicester, LE1 7RH Leicester, UK
| | - Bent Petersen
- Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen,1353 Denmark
- Centre of Excellence for Omics-Driven Computational Biodiscovery, AIMST University,08100 Kedah, Malaysia
| | - M Thomas P Gilbert
- Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen,1353 Denmark
- Center for Evolutionary Hologenomics, The GLOBE Institute, University of Copenhagen,1353 Copenhagen, Denmark
- University Museum, NTNU, 7012 Trondheim, Norway
| | - Martha R J Clokie
- Department of Genetics and Genome Biology, University of Leicester, LE1 7RH Leicester, UK
| | - Thomas Sicheritz-Pontén
- Section for Evolutionary Genomics, The GLOBE Institute, University of Copenhagen, Copenhagen,1353 Denmark
- Centre of Excellence for Omics-Driven Computational Biodiscovery, AIMST University,08100 Kedah, Malaysia
| |
Collapse
|
9
|
Sengupta DC, Hill MD, Benton KR, Banerjee HN. Similarity Studies of Corona Viruses through Chaos Game Representation. COMPUTATIONAL MOLECULAR BIOSCIENCE 2020; 10:61-72. [PMID: 32953249 PMCID: PMC7497811 DOI: 10.4236/cmb.2020.103004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
The novel coronavirus (SARS-COV-2) is generally referred to as Covid-19 virus has spread to 213 countries with nearly 7 million confirmed cases and nearly 400,000 deaths. Such major outbreaks demand classification and origin of the virus genomic sequence, for planning, containment, and treatment. Motivated by the above need, we report two alignment-free methods combing with CGR to perform clustering analysis and create a phylogenetic tree based on it. To each DNA sequence we associate a matrix then define distance between two DNA sequences to be the distance between their associated matrix. These methods are being used for phylogenetic analysis of coronavirus sequences. Our approach provides a powerful tool for analyzing and annotating genomes and their phylogenetic relationships. We also compare our tool to ClustalX algorithm which is one of the most popular alignment methods. Our alignment-free methods are shown to be capable of finding closest genetic relatives of coronaviruses.
Collapse
Affiliation(s)
- Dipendra C Sengupta
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Matthew D Hill
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Kevin R Benton
- Department of Mathematics, Computer Science & Engineering Technology, Elizabeth City State University, Elizabeth City, North Carolina, USA
| | - Hirendra N Banerjee
- Department Natural Sciences, Elizabeth City State University, Elizabeth City, North Carolina, USA
| |
Collapse
|
10
|
McNair K, Aziz RK, Pusch GD, Overbeek R, Dutilh BE, Edwards R. Phage Genome Annotation Using the RAST Pipeline. Methods Mol Biol 2018; 1681:231-238. [PMID: 29134599 DOI: 10.1007/978-1-4939-7343-9_17] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Phages are complex biomolecular machineries that have to survive in a bacterial world. Phage genomes show many adaptations to their lifestyle such as shorter genes, reduced capacity for redundant DNA sequences, and the inclusion of tRNAs in their genomes. In addition, phages are not free-living, they require a host for replication and survival. These unique adaptations provide challenges for the bioinformatics analysis of phage genomes. In particular, ORF calling, genome annotation, noncoding RNA (ncRNA) identification, and the identification of transposons and insertions are all complicated in phage genome analysis. We provide a road map through the phage genome annotation pipeline, and discuss the challenges and solutions for phage genome annotation as we have implemented in the rapid annotation using subsystems (RAST) pipeline.
Collapse
Affiliation(s)
- Katelyn McNair
- Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA
| | - Ramy Karam Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, 11562, Egypt.,Argonne National Laboratory, 9700 S. Cass Ave, Argonne, IL, 60439, USA
| | - Gordon D Pusch
- Argonne National Laboratory, 9700 S. Cass Ave, Argonne, IL, 60439, USA
| | - Ross Overbeek
- Argonne National Laboratory, 9700 S. Cass Ave, Argonne, IL, 60439, USA
| | - Bas E Dutilh
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584, Utrecht, The Netherlands.,Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Centre, Geert Grooteplein 28, 6525, Nijmegen, The Netherlands
| | - Robert Edwards
- Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA. .,Departments of Biology and Computer Science, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA.
| |
Collapse
|
11
|
Akhter S, Aziz RK, Kashef MT, Ibrahim ES, Bailey B, Edwards RA. Kullback Leibler divergence in complete bacterial and phage genomes. PeerJ 2017; 5:e4026. [PMID: 29204318 PMCID: PMC5712468 DOI: 10.7717/peerj.4026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 10/22/2017] [Indexed: 12/11/2022] Open
Abstract
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.
Collapse
Affiliation(s)
- Sajia Akhter
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
| | - Ramy K Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America
| | - Mona T Kashef
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Eslam S Ibrahim
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Barbara Bailey
- Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA
| | - Robert A Edwards
- Computational Science Research Center, San Diego State University, San Diego, CA, USA.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America.,Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA.,Department of Biology, San Diego State University, San Diego, CA, USA
| |
Collapse
|
12
|
Tripathi R, Patel S, Kumari V, Chakraborty P, Varadwaj PK. DeepLNC, a long non-coding RNA prediction tool using deep neural network. ACTA ACUST UNITED AC 2016. [DOI: 10.1007/s13721-016-0129-2] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
13
|
Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population genomes from metagenome datasets. MICROBIOME 2016; 4:8. [PMID: 26951112 PMCID: PMC4782286 DOI: 10.1186/s40168-016-0154-5] [Citation(s) in RCA: 159] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Accepted: 02/05/2016] [Indexed: 05/03/2023]
Abstract
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.
Collapse
Affiliation(s)
- Naseer Sangwan
- Biosciences Division (BIO), Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL, 60439, USA.
- Department of Surgery, University of Chicago, 5841 South Maryland Avenue, MC 5029, Chicago, IL, 60637, USA.
| | - Fangfang Xia
- Computing, Environment and Life Sciences, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL, 60439, USA.
| | - Jack A Gilbert
- Biosciences Division (BIO), Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL, 60439, USA.
- Department of Ecology and Evolution, University of Chicago, 1101 E 57th Street, Chicago, IL, 60637, USA.
- Department of Surgery, University of Chicago, 5841 South Maryland Avenue, MC 5029, Chicago, IL, 60637, USA.
- Marine Biological Laboratory, 7 MBL Street, Woods Hole, MA, 02543, USA.
| |
Collapse
|
14
|
Nigatu D, Henkel W, Sobetzko P, Muskhelishvili G. Relationship between digital information and thermodynamic stability in bacterial genomes. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2016; 2016:4. [PMID: 26877724 PMCID: PMC4740571 DOI: 10.1186/s13637-016-0037-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Accepted: 01/19/2016] [Indexed: 02/06/2023]
Abstract
Ever since the introduction of the Watson-Crick model, numerous efforts have been made to fully characterize the digital information content of the DNA. However, it became increasingly evident that variations of DNA configuration also provide an “analog” type of information related to the physicochemical properties of the DNA, such as thermodynamic stability and supercoiling. Hence, the parallel investigation of the digital information contained in the base sequence with associated analog parameters is very important for understanding the coding capacity of the DNA. In this paper, we represented analog information by its thermodynamic stability and compare it with digital information using Shannon and Gibbs entropy measures on the complete genome sequences of several bacteria, including Escherichia coli (E. coli), Bacillus subtilis (B. subtilis), Streptomyces coelicolor (S. coelicolor), and Salmonella typhimurium (S. typhimurium). Furthermore, the link to the broader classes of functional gene groups (anabolic and catabolic) is examined. Obtained results demonstrate the couplings between thermodynamic stability and digital sequence organization in the bacterial genomes. In addition, our data suggest a determinative role of the genome-wide distribution of DNA thermodynamic stability in the spatial organization of functional gene groups.
Collapse
Affiliation(s)
- Dawit Nigatu
- Transmission Systems Group, School of Engineering and Science, Jacobs University Bremen, Campus Ring 1, Bremen, 28759 Germany
| | - Werner Henkel
- Transmission Systems Group, School of Engineering and Science, Jacobs University Bremen, Campus Ring 1, Bremen, 28759 Germany
| | - Patrick Sobetzko
- Philipps-Universität Marburg, LOEWE-Zentrum für Synthetische Mikrobiologie, Hans-Meerwein-Straße, Mehrzweckgebäude, Marburg, 35043 Germany
| | - Georgi Muskhelishvili
- Microbiologie, Adaptation, Pathogénie, UMR5240 CNRS-UCBL-INSA-BayerCropScience, Lyon, France ; Jacobs University Bremen, Campus Ring 1, Bremen, 28759 Germany
| |
Collapse
|
15
|
Borozan I, Watt S, Ferretti V. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification. ACTA ACUST UNITED AC 2015; 31:1396-404. [PMID: 25573913 PMCID: PMC4410667 DOI: 10.1093/bioinformatics/btv006] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2014] [Accepted: 01/05/2015] [Indexed: 01/02/2023]
Abstract
MOTIVATION Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. RESULTS Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. AVAILABILITY AND IMPLEMENTATION All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. CONTACT ivan.borozan@gmail.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ivan Borozan
- Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada
| | - Stuart Watt
- Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada
| | - Vincent Ferretti
- Department of Informatics and Bio-computing, Ontario Institute for Cancer Research, MaRS Centre, South Tower, 101 College Street, Suite 800, Toronto, Ontario, Canada
| |
Collapse
|
16
|
Hepatitis C Virus (HCV) NS3 sequence diversity and antiviral resistance-associated variant frequency in HCV/HIV coinfection. Antimicrob Agents Chemother 2014; 58:6079-92. [PMID: 25092699 DOI: 10.1128/aac.03466-14] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
HIV coinfection accelerates disease progression in chronic hepatitis C and reduces sustained antiviral responses (SVR) to interferon-based therapy. New direct-acting antivirals (DAAs) promise higher SVR rates, but the selection of preexisting resistance-associated variants (RAVs) may lead to virologic breakthrough or relapse. Thus, pretreatment frequencies of RAVs are likely determinants of treatment outcome but typically are below levels at which the viral sequence can be accurately resolved. Moreover, it is not known how HIV coinfection influences RAV frequency. We adopted an accurate high-throughput sequencing strategy to compare nucleotide diversity in HCV NS3 protease-coding sequences in 20 monoinfected and 20 coinfected subjects with well-controlled HIV infection. Differences in mean pairwise nucleotide diversity (π), Tajima's D statistic, and Shannon entropy index suggested that the genetic diversity of HCV is reduced in coinfection. Among coinfected subjects, diversity correlated positively with increases in CD4(+) T cells on antiretroviral therapy, suggesting T cell responses are important determinants of diversity. At a median sequencing depth of 0.084%, preexisting RAVs were readily identified. Q80K, which negatively impacts clinical responses to simeprevir, was encoded by more than 99% of viral RNAs in 17 of the 40 subjects. RAVs other than Q80K were identified in 39 of 40 subjects, mostly at frequencies near 0.1%. RAV frequency did not differ significantly between monoinfected and coinfected subjects. We conclude that HCV genetic diversity is reduced in patients with well-controlled HIV infection, likely reflecting impaired T cell immunity. However, RAV frequency is not increased and should not adversely influence the outcome of DAA therapy.
Collapse
|
17
|
Ryabov EV, Wood GR, Fannon JM, Moore JD, Bull JC, Chandler D, Mead A, Burroughs N, Evans DJ. A virulent strain of deformed wing virus (DWV) of honeybees (Apis mellifera) prevails after Varroa destructor-mediated, or in vitro, transmission. PLoS Pathog 2014; 10:e1004230. [PMID: 24968198 PMCID: PMC4072795 DOI: 10.1371/journal.ppat.1004230] [Citation(s) in RCA: 237] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2014] [Accepted: 04/30/2014] [Indexed: 02/06/2023] Open
Abstract
The globally distributed ectoparasite Varroa destructor is a vector for viral pathogens of the Western honeybee (Apis mellifera), in particular the Iflavirus Deformed Wing Virus (DWV). In the absence of Varroa low levels DWV occur, generally causing asymptomatic infections. Conversely, Varroa-infested colonies show markedly elevated virus levels, increased overwintering colony losses, with impairment of pupal development and symptomatic workers. To determine whether changes in the virus population were due Varroa amplifying and introducing virulent virus strains and/or suppressing the host immune responses, we exposed Varroa-naïve larvae to oral and Varroa-transmitted DWV. We monitored virus levels and diversity in developing pupae and associated Varroa, the resulting RNAi response and transcriptome changes in the host. Exposed pupae were stratified by Varroa association (presence/absence) and virus levels (low/high) into three groups. Varroa-free pupae all exhibited low levels of a highly diverse DWV population, with those exposed per os (group NV) exhibiting changes in the population composition. Varroa-associated pupae exhibited either low levels of a diverse DWV population (group VL) or high levels of a near-clonal virulent variant of DWV (group VH). These groups and unexposed controls (C) could be also discriminated by principal component analysis of the transcriptome changes observed, which included several genes involved in development and the immune response. All Varroa tested contained a diverse replicating DWV population implying the virulent variant present in group VH, and predominating in RNA-seq analysis of temporally and geographically separate Varroa-infested colonies, was selected upon transmission from Varroa, a conclusion supported by direct injection of pupae in vitro with mixed virus populations. Identification of a virulent variant of DWV, the role of Varroa in its transmission and the resulting host transcriptome changes furthers our understanding of this important viral pathogen of honeybees. Honeybees are the most important managed pollinating insect, contributing billions of dollars to annual global agricultural production. Over the last century a parasitic mite, Varroa, has spread worldwide, with significant impacts on honeybee colony health as a consequence of its transmission of a cocktail of viruses while feeding on honeybee ‘blood’. The most important virus for colony health is deformed wing virus (DWV), high levels of which cause developmental deformities and premature ageing resulting in high overwintering colony losses. In experiments on individual Varroa-exposed pupae we demonstrate that a single type of virulent DWV is amplified 1,000–10,000 times in the recipient pupae, despite the mite containing a high diversity of replicating DWV strains. We could recapitulate this by direct injection of pupae with mixed virus populations, showing the virulent strain is advantaged by the route of transmission. In parallel, we detected changes in the immune response and developmental gene expression of the honeybee and propose that these contribute to the characteristic pathogenesis of DWV. Identification of a virulent strain of DWV has implications for therapeutic or prophylactic interventions to improve honeybee colony health, as well as contributing to our understanding of the biology of this important honeybee viral pathogen.
Collapse
Affiliation(s)
- Eugene V. Ryabov
- School of Life Sciences, University of Warwick, Coventry, United Kingdom
- * E-mail:
| | - Graham R. Wood
- Warwick Systems Biology Centre, University of Warwick, Coventry, United Kingdom
| | - Jessica M. Fannon
- School of Life Sciences, University of Warwick, Coventry, United Kingdom
| | - Jonathan D. Moore
- Warwick Systems Biology Centre, University of Warwick, Coventry, United Kingdom
| | - James C. Bull
- School of Life Sciences, University of Warwick, Coventry, United Kingdom
| | - Dave Chandler
- Life Sciences & Warwick Crop Centre, University of Warwick, Wellesbourne, Warwickshire, United Kingdom
| | - Andrew Mead
- School of Life Sciences, University of Warwick, Coventry, United Kingdom
| | - Nigel Burroughs
- Warwick Systems Biology Centre, University of Warwick, Coventry, United Kingdom
| | - David J. Evans
- School of Life Sciences, University of Warwick, Coventry, United Kingdom
| |
Collapse
|
18
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
19
|
Carbone A. Information measure for long-range correlated sequences: the case of the 24 human chromosomes. Sci Rep 2014; 3:2721. [PMID: 24056670 PMCID: PMC3779848 DOI: 10.1038/srep02721] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Accepted: 09/04/2013] [Indexed: 01/14/2023] Open
Abstract
A new approach to estimate the Shannon entropy of a long-range correlated sequence is proposed. The entropy is written as the sum of two terms corresponding respectively to power-law (ordered) and exponentially (disordered) distributed blocks (clusters). The approach is illustrated on the 24 human chromosome sequences by taking the nucleotide composition as the relevant information to be encoded/decoded. Interestingly, the nucleotide composition of the ordered clusters is found, on the average, comparable to the one of the whole analyzed sequence, while that of the disordered clusters fluctuates. From the information theory standpoint, this means that the power-law correlated clusters carry the same information of the whole analysed sequence. Furthermore, the fluctuations of the nucleotide composition of the disordered clusters are linked to relevant biological properties, such as segmental duplications and gene density.
Collapse
Affiliation(s)
- A Carbone
- 1] Politecnico di Torino, Italy [2] ISC-CNR, Unità Università 'La Sapienza' di Roma, Italy [3] ETH Zurich, Switzerland
| |
Collapse
|