1
|
Liu J, Zhou D. Minimum Functional Length Analysis of K-Mer Based on BPNN. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2920-2925. [PMID: 34310316 DOI: 10.1109/tcbb.2021.3098512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
BP neural network (BPNN), as a multilayer feed-forward network, can realize the deep cognition to target data and high accuracy to output results. However, there were still no related research of k-mer based on BPNN yet. In present study, BPNN was used to train and test binary classification data of each classification mode respectively. All k-mer were divided into two categories according to the X + Y content or completely random mode. Results showed that 1) For classification mode of X + Y content, the accuracy of k-mers classification was 100 percent, no matter k ≤ 6 or k ≥ 7; 2) For completely random classification mode, the accuracy of classification is 100 percent for k-mers of k ≤ 6; But for k-mers of k ≥ 7, the accuracy is less than 100 percent, and with the increase of k value, the accuracy of classification gradually decreases (gradually approaches 50 percent). The k-mers of k ≥ 7 should be the basic functional fragment of nucleic acid, and perform basic nucleic acid function in the DNA sequence. The k-mers of k ≤ 6 should be the basic component fragment of nucleic acid, and no longer perform basic nucleic acid function.
Collapse
|
2
|
Bonomo M, Giancarlo R, Greco D, Rombo SE. Topological ranks reveal functional knowledge encoded in biological networks: a comparative analysis. Brief Bioinform 2022; 23:6563936. [PMID: 35381599 DOI: 10.1093/bib/bbac101] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 01/31/2022] [Accepted: 02/28/2022] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Biological networks topology yields important insights into biological function, occurrence of diseases and drug design. In the last few years, different types of topological measures have been introduced and applied to infer the biological relevance of network components/interactions, according to their position within the network structure. Although comparisons of such measures have been previously proposed, to what extent the topology per se may lead to the extraction of novel biological knowledge has never been critically examined nor formalized in the literature. RESULTS We present a comparative analysis of nine outstanding topological measures, based on compact views obtained from the rank they induce on a given input biological network. The goal is to understand their ability in correctly positioning nodes/edges in the rank, according to the functional knowledge implicitly encoded in biological networks. To this aim, both internal and external (gold standard) validation criteria are taken into account, and six networks involving three different organisms (yeast, worm and human) are included in the comparison. The results show that a distinct handful of best-performing measures can be identified for each of the considered organisms, independently from the reference gold standard. AVAILABILITY Input files and code for the computation of the considered topological measures and K-haus distance are available at https://gitlab.com/MaryBonomo/ranking. CONTACT simona.rombo@unipa.it. SUPPLEMENTARY INFORMATION Supplementary data are available at Briefings in Bioinformatics online.
Collapse
Affiliation(s)
- Mariella Bonomo
- Department of Engineering, University of Palermo, Palermo, 90121, Italy, Palermo
| | - Raffaele Giancarlo
- Department of Mathematics and Computer Science, University of Palermo, Palermo, 90121, Italy, Palermo
| | - Daniele Greco
- Department of Mathematics and Computer Science, University of Palermo, Palermo, 90121, Italy, Palermo
| | - Simona E Rombo
- Department of Mathematics and Computer Science, University of Palermo, Palermo, 90121, Italy, Palermo
| |
Collapse
|
3
|
Cattaneo G, Ferraro Petrillo U, Giancarlo R, Palini F, Romualdi C. The power of word-frequency-based alignment-free functions: a comprehensive large-scale experimental analysis. Bioinformatics 2022; 38:925-932. [PMID: 34718420 DOI: 10.1093/bioinformatics/btab747] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 10/07/2021] [Accepted: 10/26/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Alignment-free (AF) distance/similarity functions are a key tool for sequence analysis. Experimental studies on real datasets abound and, to some extent, there are also studies regarding their control of false positive rate (Type I error). However, assessment of their power, i.e. their ability to identify true similarity, has been limited to some members of the D2 family. The corresponding experimental studies have concentrated on short sequences, a scenario no longer adequate for current applications, where sequence lengths may vary considerably. Such a State of the Art is methodologically problematic, since information regarding a key feature such as power is either missing or limited. RESULTS By concentrating on a representative set of word-frequency-based AF functions, we perform the first coherent and uniform evaluation of the power, involving also Type I error for completeness. Two alternative models of important genomic features (CIS Regulatory Modules and Horizontal Gene Transfer), a wide range of sequence lengths from a few thousand to millions, and different values of k have been used. As a result, we provide a characterization of those AF functions that is novel and informative. Indeed, we identify weak and strong points of each function considered, which may be used as a guide to choose one for analysis tasks. Remarkably, of the 15 functions that we have considered, only four stand out, with small differences between small and short sequence length scenarios. Finally, to encourage the use of our methodology for validation of future AF functions, the Big Data platform supporting it is public. AVAILABILITY AND IMPLEMENTATION The software is available at: https://github.com/pipp8/power_statistics. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy
| | | | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, 90133 Palermo, Italy
| | - Francesco Palini
- Dipartimento di Scienze Statistiche, Università di Roma-La Sapienza, 00185 Rome, Italy
| | - Chiara Romualdi
- Dipartimento di Biologia, Università di Padova, 35131 Padova, Italy
| |
Collapse
|
4
|
Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC. EpitopeVec: Linear Epitope Prediction Using Deep Protein Sequence Embeddings. Bioinformatics 2021; 37:4517-4525. [PMID: 34180989 PMCID: PMC8652027 DOI: 10.1093/bioinformatics/btab467] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Revised: 05/28/2021] [Accepted: 06/25/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation B-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immuno-diagnostic reagents and antibody production, and thus in infectious disease prevention and diagnostics in general. Experimental methods used to determine BCEs are costly and time-consuming. Therefore, it is essential to develop computational methods for the rapid identification of BCEs. Although several computational methods have been developed for this task, generalizability is still a major concern, where cross-testing of the classifiers trained and tested on different datasets has revealed accuracies of 51–53%. Results We describe a new method called EpitopeVec, which uses a combination of residue properties, modified antigenicity scales, and protein language model-based representations (protein vectors) as features of peptides for linear BCE predictions. Extensive benchmarking of EpitopeVec and other state-of-the-art methods for linear BCE prediction on several large and small datasets, as well as cross-testing, demonstrated an improvement in the performance of EpitopeVec over other methods in terms of accuracy and area under the curve. As the predictive performance depended on the species origin of the respective antigens (viral, bacterial and eukaryotic), we also trained our method on a large viral dataset to create a dedicated linear viral BCE predictor with improved cross-testing performance. Availability and implementation The software is available at https://github.com/hzi-bifo/epitope-prediction. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akash Bahai
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany.,Braunschweig Integrated Center of Systems Biology (BRICS), Technische Universität Braunschweig, Rebenring 56, 38106 Braunschweig
| | - Ehsaneddin Asgari
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany.,Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA
| | - Mohammad R K Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, CA, 94720, USA.,Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Lab, Berkeley, CA 94720, USA
| | - Andreas Kloetgen
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124 Braunschweig, Germany.,Braunschweig Integrated Center of Systems Biology (BRICS), Technische Universität Braunschweig, Rebenring 56, 38106 Braunschweig
| |
Collapse
|
5
|
Petrillo UF, Palini F, Cattaneo G, Giancarlo R. Alignment-free Genomic Analysis via a Big Data Spark Platform. Bioinformatics 2021; 37:1658-1665. [PMID: 33471066 DOI: 10.1093/bioinformatics/btab014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 12/28/2020] [Accepted: 01/06/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to pairwise and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in computational biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. RESULTS We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE. AVAILABILITY The software and the datasets are available at https://github.com/fpalini/fade. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Francesco Palini
- Dipartimento di Scienze Statistiche, Università di Roma - La Sapienza, Rome, 00185, Italy
| | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano (SA), 84084, Italy
| | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo, 90133, Italy
| |
Collapse
|
6
|
Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, Li X, Luo L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol 2020; 20:157. [PMID: 33228538 PMCID: PMC7684957 DOI: 10.1186/s12862-020-01723-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/10/2020] [Indexed: 11/17/2022] Open
Abstract
Background K-mer spectra of DNA sequences contain important information about sequence composition and sequence evolution. We want to reveal the evolution rules of genome sequences by studying the k-mer spectra of genome sequences. Results The intrinsic laws of k-mer spectra of 920 genome sequences from primate to prokaryote were analyzed. We found that there are two types of evolution selection modes in genome sequences, named as CG Independent Selection and TA Independent Selection. There is a mutual inhibition relationship between CG and TA independent selections. We found that the intensity of CG and TA independent selections correlates closely with genome evolution and G + C content of genome sequences. The living habits of species are related closely to the independent selection modes adopted by species genomes. Consequently, we proposed an evolution mechanism of genomes in which the genome evolution is determined by the intensities of the CG and TA independent selections and the mutual inhibition relationship. Besides, by the evolution mechanism of genomes, we speculated the evolution modes of prokaryotes in mild and extreme environments in the anaerobic age and the evolving process of prokaryotes from anaerobic to aerobic environment on earth as well as the originations of different eukaryotes. Conclusion We found that there are two independent selection modes in genome sequences. The evolution of genome sequence is determined by the two independent selection modes and the mutual inhibition relationship between them.
Collapse
Affiliation(s)
- Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.,School of Economics and Management, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Yun Jia
- College of Science, Inner Mongolia University of Technology, Hohhot, 010051, China
| | - Yan Zheng
- Baotou Medical College, Inner Mongolia University of Science & Technology, Baotou, 014040, China
| | - Hu Meng
- School of Life Science & Technology, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Tonglaga Bao
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Liaofu Luo
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
7
|
Giancarlo R, Rombo SE, Utro F. In vitro versus in vivo compositional landscapes of histone sequence preferences in eucaryotic genomes. Bioinformatics 2019; 34:3454-3460. [PMID: 30204840 DOI: 10.1093/bioinformatics/bty799] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2018] [Accepted: 09/08/2018] [Indexed: 12/16/2022] Open
Abstract
Motivation Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vitro and in vivo have a counterpart in terms of the underlying genomic sequences. Results We present the first quantitative comparison between the in vitro and in vivo nucleosome maps of two model organisms (S. cerevisiae and C. elegans). The comparison is based on the construction of weighted k-mer dictionaries. Our findings show that there is a good level of sequence conservation between in vitro and in vivo in both the two organisms, in contrast to the abovementioned important differences in chromatin structural organization. Moreover, our results provide evidence that the two organisms predispose themselves differently, in terms of sequence composition and both in vitro and in vivo, for the nucleosome occupancy. This leads to the conclusion that, although the notion of a genome encoding for its own nucleosome occupancy is general, the intrinsic histone k-mer sequence preferences tend to be species-specific. Availability and implementation The files containing the dictionaries and the main results of the analysis are available at http://math.unipa.it/rombo/material. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università degli Studi di Palermo, Palermo, Italy
| | - Simona E Rombo
- Dipartimento di Matematica ed Informatica, Università degli Studi di Palermo, Palermo, Italy
| | - Filippo Utro
- Computational Biology Center, IBM T. J. Watson Research, Yorktown Heights, NY, USA
| |
Collapse
|
8
|
Ferraro Petrillo U, Roscigno G, Cattaneo G, Giancarlo R. Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. Bioinformatics 2019; 34:1826-1833. [PMID: 29342232 DOI: 10.1093/bioinformatics/bty018] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 01/09/2018] [Indexed: 02/03/2023] Open
Abstract
Motivation Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes. Results Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications. Availability and implementation The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH. Contact umberto.ferraro@uniroma1.it. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Gianluca Roscigno
- Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy
| | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano, SA 84084, Italy
| | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo 90133, Italy
| |
Collapse
|
9
|
Ferraro Petrillo U, Sorella M, Cattaneo G, Giancarlo R, Rombo SE. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics. BMC Bioinformatics 2019; 20:138. [PMID: 30999863 PMCID: PMC6471689 DOI: 10.1186/s12859-019-2694-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. Results One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. Conclusions We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.
Collapse
Affiliation(s)
| | - Mara Sorella
- Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Università di Roma - La Sapienza, Rome, 00185, Italy
| | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano (SA), 84084, Italy
| | - Raffaele Giancarlo
- Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo, 90133, Italy.
| | - Simona E Rombo
- Dipartimento di Matematica ed Informatica, Università di Palermo, Palermo, 90133, Italy
| |
Collapse
|
10
|
Fassetti F, Giallombardo C, Leone O, Palopoli L, Rombo SE, Saiardi A. FEDRO: a software tool for the automatic discovery of candidate ORFs in plants with c →u RNA editing. BMC Bioinformatics 2019; 20:124. [PMID: 30999847 PMCID: PMC6471690 DOI: 10.1186/s12859-019-2696-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA editing is an important mechanism for gene expression in plants organelles. It alters the direct transfer of genetic information from DNA to proteins, due to the introduction of differences between RNAs and the corresponding coding DNA sequences. Software tools successful for the search of genes in other organisms not always are able to correctly perform this task in plants organellar genomes. Moreover, the available software tools predicting RNA editing events utilise algorithms that do not account for events which may generate a novel start codon. RESULTS We present FEDRO, a Java software tool implementing a novel strategy to generate candidate Open Reading Frames (ORFs) resulting from Cytidine to Uridine (c→u) editing substitutions which occur in the mitochondrial genome (mtDNA) of a given input plant. The goal is to predict putative proteins of plants mitochondria that have not been yet annotated. In order to validate the generated ORFs, a screening is performed by checking for sequence similarity or presence in active transcripts of the same or similar organisms. We illustrate the functionalities of our framework on a model organism. CONCLUSIONS The proposed tool may be used also on other organisms and genomes. FEDRO is publicly available at http://math.unipa.it/rombo/FEDRO .
Collapse
Affiliation(s)
- Fabio Fassetti
- DIMES, Università della Calabria, Via Pietro Bucci 41 C, Cosenza, Italy
| | - Claudia Giallombardo
- Department of Mathematics and Computer Science, Università degli Studi di Palermo, Via Archirafi 34, Palermo, Italy
| | - Ofelia Leone
- DIMES, Università della Calabria, Via Pietro Bucci 41 C, Cosenza, Italy
| | - Luigi Palopoli
- DIMES, Università della Calabria, Via Pietro Bucci 41 C, Cosenza, Italy
| | - Simona E Rombo
- Department of Mathematics and Computer Science, Università degli Studi di Palermo, Via Archirafi 34, Palermo, Italy.
| | - Adolfo Saiardi
- LMCB, MRC, Cell Biology Unit and Department of Developmental Biology, University College, London, UK
| |
Collapse
|
11
|
Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep 2019; 9:3577. [PMID: 30837494 PMCID: PMC6401088 DOI: 10.1038/s41598-019-38746-w] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 12/19/2018] [Indexed: 12/28/2022] Open
Abstract
In this paper, we present peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. The idea of PPE segmentation is inspired by the byte-pair encoding (BPE) text compression algorithm, which has recently gained popularity in subword neural machine translation. We modify this algorithm by adding a sampling framework allowing for multiple ways of segmenting a sequence. PPE segmentation steps can be learned over a large set of protein sequences (Swiss-Prot) or even a domain-specific dataset and then applied to a set of unseen sequences. This representation can be widely used as the input to any downstream machine learning tasks in protein bioinformatics. In particular, here, we introduce this representation through protein motif discovery and protein sequence embedding. (i) DiMotif: we present DiMotif as an alignment-free discriminative motif discovery method and evaluate the method for finding protein motifs in three different settings: (1) comparison of DiMotif with two existing approaches on 20 distinct motif discovery problems which are experimentally verified, (2) classification-based approach for the motifs extracted for integrins, integrin-binding proteins, and biofilm formation, and (3) in sequence pattern searching for nuclear localization signal. The DiMotif, in general, obtained high recall scores, while having a comparable F1 score with other methods in the discovery of experimentally verified motifs. Having high recall suggests that the DiMotif can be used for short-list creation for further experimental investigations on motifs. In the classification-based evaluation, the extracted motifs could reliably detect the integrins, integrin-binding, and biofilm formation-related proteins on a reserved set of sequences with high F1 scores. (ii) ProtVecX: we extend k-mer based protein vector (ProtVec) embedding to variablelength protein embedding using PPE sub-sequences. We show that the new method of embedding can marginally outperform ProtVec in enzyme prediction as well as toxin prediction tasks. In addition, we conclude that the embeddings are beneficial in protein classification tasks when they are combined with raw amino acids k-mer features.
Collapse
|
12
|
Zheng Y, Li H, Wang Y, Meng H, Zhang Q, Zhao X. Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast. Chromosome Res 2017; 25:173-189. [PMID: 28181048 DOI: 10.1007/s10577-017-9554-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 12/27/2016] [Accepted: 01/27/2017] [Indexed: 01/01/2023]
Abstract
The rules of k-mer non-random usage and the biological functions are worthy of special attention. Firstly, the article studied human 8-mer spectra and found that only the spectra of cytosine-guanine (CG) dinucleotide classification formed independent unimodal distributions when the 8-mers were classified into three subsets under 16 dinucleotide classifications. Secondly, the distribution rules were reproduced by other seven species including yeast, which showed that the evolution phenomenon had species universality. It followed that we proposed two theoretical conjectures: (1) CG1 motifs (8-mers including 1 CG) are the nucleosome-binding motifs. (2) CG2 motifs (8-mers including two or more than two CG) are the modular units of CpG islands. Our conjectures were confirmed in yeast by the following results: a maximum of average area under the receiver operating characteristic (AUC) resulted from CG1 information during nucleosome core sequences, and linker sequences were distinguished by three CG subsets; there was a one-to-one relationship between abundant CG1 signal regions and histone positions; the sequence changing of squeezed nucleosomes was relevant with the strength of CG1 signals; and the AUC value of 0.986 was based on CG2 information when CpG islands and non-CpG islands were distinguished by the three CG subsets.
Collapse
Affiliation(s)
- Yan Zheng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China. .,, No.235, West University Street, Hohhot, Inner Mongolia, China.
| | - Yue Wang
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Hu Meng
- Laboratory of Theoretical Biophysics, School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Qiang Zhang
- College of Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Xiaoqing Zhao
- Biotechnology research centre, Inner Mongolia Academy of Agricultural and Animal Husbandry Science, Hohhot, 010021, China
| |
Collapse
|
13
|
Cattaneo G, Giancarlo R, Piotto S, Ferraro Petrillo U, Roscigno G, Di Biasi L. MapReduce in Computational Biology - A Synopsis. ADVANCES IN ARTIFICIAL LIFE, EVOLUTIONARY COMPUTATION, AND SYSTEMS CHEMISTRY 2017. [DOI: 10.1007/978-3-319-57711-1_5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
14
|
Awazu A. Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition. Bioinformatics 2016; 33:42-48. [PMID: 27563027 PMCID: PMC5860184 DOI: 10.1093/bioinformatics/btw562] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Revised: 08/02/2016] [Accepted: 08/19/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Nucleosome positioning plays important roles in many eukaryotic intranuclear processes, such as transcriptional regulation and chromatin structure formation. The investigations of nucleosome positioning rules provide a deeper understanding of these intracellular processes. Results Nucleosome positioning prediction was performed using a model consisting of three types of variables characterizing a DNA sequence—the number of five-nucleotide sequences, the number of three-nucleotide combinations in one period of a helix, and mono- and di-nucleotide distributions in DNA fragments. Using recently proposed stringent benchmark datasets with low biases for Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans and Drosophila melanogaster, the present model was shown to have a better prediction performance than the recently proposed predictors. This model was able to display the common and organism-dependent factors that affect nucleosome forming and inhibiting sequences as well. Therefore, the predictors developed here can accurately predict nucleosome positioning and help determine the key factors influencing this process. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akinori Awazu
- Department of Mathematical and Life Sciences.,Research Center for Mathematics on Chromatin Live Dynamics, Hiroshima University, Kagami-yama 1-3-1, Higashi-Hiroshima, 739-8526, Japan
| |
Collapse
|
15
|
Utro F, Di Benedetto V, Corona DF, Giancarlo R. The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes. Bioinformatics 2015; 32:835-42. [DOI: 10.1093/bioinformatics/btv679] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Accepted: 11/09/2015] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only.
Results: We contribute to close this important methodological gap between the two models by providing three very simple formulas for the sequence specific one. They are all based on well-known formulas in Computer Science and Bioinformatics, and they give different quantifications of how complex a sequence is. In view of how remarkably well they perform, it is very surprising that measures of sequence complexity have not even been considered as candidates to close the mentioned gap. We provide experimental evidence that the intrinsic level of combinatorial organization and information-theoretic content of subsequences within a genome are strongly correlated to the level of DNA encoded nucleosome organization discovered by Kaplan et al. Our results establish an important connection between the intrinsic complexity of subsequences in a genome and the intrinsic, i.e. DNA encoded, nucleosome organization of eukaryotic genomes. It is a first step towards a mathematical characterization of this latter ‘encoding’.
Supplementary information: Supplementary data are available at Bioinformatics online.
Contact: futro@us.ibm.com.
Collapse
Affiliation(s)
- Filippo Utro
- Computational Genomics Group, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA,
| | | | - Davide F.V. Corona
- Dipartimento STEBICEF, Dulbecco Telethon Institute c/o Università di Palermo, Palermo, Italy
| | | |
Collapse
|