1
|
Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022; 12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.
Collapse
Affiliation(s)
- Jorge M Silva
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Weihong Qi
- Functional Genomics Center Zurich, ETH Zurich and University of Zurich, Winterthurerstrasse, 190, 8057, Zurich, Switzerland
- SIB, Swiss Institute of Bioinformatics, 1202, Geneva, Switzerland
| | - Armando J Pinho
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
| | - Diogo Pratas
- IEETA, Institute of Electronics and Informatics Engineering of Aveiro, and LASI, Intelligent Systems Associate Laboratory, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193, Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu, 3, 00014 Helsinki, Finland
| |
Collapse
|
2
|
Dingle K, Camargo CQ, Louis AA. Input-output maps are strongly biased towards simple outputs. Nat Commun 2018; 9:761. [PMID: 29472533 PMCID: PMC5823903 DOI: 10.1038/s41467-018-03101-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 01/19/2018] [Indexed: 11/09/2022] Open
Abstract
Many systems in nature can be described using discrete input–output maps. Without knowing details about a map, there may seem to be no a priori reason to expect that a randomly chosen input would be more likely to generate one output over another. Here, by extending fundamental results from algorithmic information theory, we show instead that for many real-world maps, the a priori probability P(x) that randomly sampled inputs generate a particular output x decays exponentially with the approximate Kolmogorov complexity \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\tilde K(x)$$\end{document}K~(x) of that output. These input–output maps are biased towards simplicity. We derive an upper bound P(x) ≲ \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$2^{ - a\tilde K(x) - b}$$\end{document}2-aK~(x)-b, which is tight for most inputs. The constants a and b, as well as many properties of P(x), can be predicted with minimal knowledge of the map. We explore this strong bias towards simple outputs in systems ranging from the folding of RNA secondary structures to systems of coupled ordinary differential equations to a stochastic financial trading model. Algorithmic information theory measures the complexity of strings. Here the authors provide a practical bound on the probability that a randomly generated computer program produces a given output of a given complexity and apply this upper bound to RNA folding and financial trading algorithms.
Collapse
Affiliation(s)
- Kamaludin Dingle
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, OX1 3NP, UK. .,Systems Biology DTC, University of Oxford, Oxford, OX1 3QU, UK. .,International Centre for Applied Mathematics and Computational Bioengineering, Department of Mathematics and Natural Sciences, Gulf University for Science and Technology, P.O. Box 7207, Hawally 32093, Mubarak Al-Abdullah, Kuwait.
| | - Chico Q Camargo
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, OX1 3NP, UK.,Systems Biology DTC, University of Oxford, Oxford, OX1 3QU, UK
| | - Ard A Louis
- Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Oxford, OX1 3NP, UK.
| |
Collapse
|
3
|
Pellegrini M, Renda ME, Vecchio A. TRStalker: an efficient heuristic for finding fuzzy tandem repeats. ACTA ACUST UNITED AC 2010; 26:i358-66. [PMID: 20529928 PMCID: PMC2881393 DOI: 10.1093/bioinformatics/btq209] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events. Results: We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences. Availability: TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it. Contact:marco.pellegrini@iit.cnr.it Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marco Pellegrini
- CNR, Istituto di Informatica e Telematica, Via Moruzzi 1, 56124 Pisa, Italy.
| | | | | |
Collapse
|
4
|
Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009; 25:1575-86. [DOI: 10.1093/bioinformatics/btp117] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
5
|
Kolpakov R, Bana G, Kucherov G. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res 2003; 31:3672-8. [PMID: 12824391 PMCID: PMC169196 DOI: 10.1093/nar/gkg617] [Citation(s) in RCA: 221] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The presence of repeated sequences is a fundamental feature of genomes. Tandemly repeated DNA appears in both eukaryotic and prokaryotic genomes, it is associated with various regulatory mechanisms and plays an important role in genomic fingerprinting. In this paper, we describe mreps, a powerful software tool for a fast identification of tandemly repeated structures in DNA sequences. mreps is able to identify all types of tandem repeats within a single run on a whole genomic sequence. It has a resolution parameter that allows the program to identify 'fuzzy' repeats. We introduce main algorithmic solutions behind mreps, describe its usage, give some execution time benchmarks and present several case studies to illustrate its capabilities. The mreps web interface is accessible through http://www.loria.fr/mreps/.
Collapse
Affiliation(s)
- Roman Kolpakov
- French-Russian Institute for Informatics and Applied Mathematics, Moscow University, 119899 Moscow, Russia
| | | | | |
Collapse
|
6
|
Rocha EPC. An appraisal of the potential for illegitimate recombination in bacterial genomes and its consequences: from duplications to genome reduction. Genome Res 2003; 13:1123-32. [PMID: 12743022 PMCID: PMC403640 DOI: 10.1101/gr.966203] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
An exhaustive search for shortly spaced repeats in 74 bacterial chromosomes reveals that they are much more numerous than is usually acknowledged. These repeats were divided into five classes: close repeats (CRs), tandem repeats (TRs), simple sequence repeats (SSRs), spaced interspersed direct repeats, and "others." CRs are widespread and constitute the most abundant class, particularly in coding sequences. The other classes are less frequent, but each individual element shows a higher potential for recombination, when the number of repeats and their distances are taken into account. SSRs and TRs are more frequent in pathogens, as expected given their role in contingency loci, but are also widespread in the other bacteria. The analysis of CRs shows that they have an important role in the evolution of genomes, namely by generating duplications and deletions. Several cases compatible with a significant role of small CRs in the formation of large repeats were detected. Also, gene deletion in Buchnera correlates with repeat density, suggesting that CRs may lead to sequence deletion in general and genome reductive evolution of obligatory intracellular bacteria in particular. The assembly of these results indicates that shortly spaced repeats are key players in the dynamics of genome evolution.
Collapse
Affiliation(s)
- Eduardo P C Rocha
- Unité Génétique des Génomes Bactériens, Institut Pasteur, 75724 Paris Cedex 15, France.
| |
Collapse
|
7
|
Xin C, Sam K, Ming L. A compression algorithm for DNA sequences. IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE : THE QUARTERLY MAGAZINE OF THE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY 2001; 20:61-6. [PMID: 11494771 DOI: 10.1109/51.940049] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- C Xin
- School of Mathematical Sciences, Peking University.
| | | | | |
Collapse
|
8
|
Danchin A, Guerdoux-Jamet P, Moszer I, Nitschké P. Mapping the bacterial cell architecture into the chromosome. Philos Trans R Soc Lond B Biol Sci 2000; 355:179-90. [PMID: 10724454 PMCID: PMC1692725 DOI: 10.1098/rstb.2000.0557] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A genome is not a simple collection of genes. We propose here that it can be viewed as being organized as a 'celluloculus' similar to the homunculus of preformists, but pertaining to the category of programmes (or algorithms) rather than to that of architectures or structures: a significant correlation exists between the distribution of genes along the chromosome and the physical architecture of the cell. We review here data supporting this observation, stressing physical constraints operating on the cell's architecture and dynamics, and their consequences in terms of gene and genome structure. If such a correlation exists, it derives from some selection pressure: simple and general physical principles acting at the level of the cell structure are discussed. As a first case in point we see the piling up of planar modules as a stable, entropy-driven, architectural principle that could be at the root of the coupling between the architecture of the cell and the location of genes at specific places in the chromosome. We propose that the specific organization of certain genes whose products have a general tendency to form easily planar modules is a general motor for architectural organization in the bacterial cell. A second mechanism, operating at the transcription level, is described that could account for the efficient building up of complex structures. As an organizing principle we suggest that exploration by biological polymers of the vast space of possible conformation states is constrained by anchoring points. In particular, we suggest that transcription does not always allow the 5'-end of the transcript to go free and explore the many conformations available, but that, in many cases, it remains linked to the transcribing RNA polymerase complex in such a way that loops of RNA, rather than threads with a free end, explore the surrounding medium. In bacteria, extension of the loops throughout the cytoplasm would therefore be mediated by the de novo synthesis of ribosomes in growing cells. Termination of transcription and mRNA turnover would accordingly be expected to be controlled by sequence features at both the 3'- and 5'-ends of the molecule. These concepts are discussed taking into account in vitro analysis of genome sequences and experimental data about cell compartmentalization, mRNA folding and turnover, as well as known structural features of protein and membrane complexes.
Collapse
Affiliation(s)
- A Danchin
- Regulation de l'Expression Génétique, Institut Pasteur, Paris, France.
| | | | | | | |
Collapse
|
9
|
Abstract
We have developed PRIDE, a primer design program that automatically designs primers in single contigs or whole sequencing projects to extend the already known sequence and to double strand single-stranded regions. The program is fully integrated into the Staden package (GAP4) and accessible with a graphical user interface. PRIDE uses a fuzzy logic-based system to calculate primer qualities. The computational performance of PRIDE is enhanced by using suffix trees to store the huge amount of data being produced. A test set of 110 sequencing primers and 11 PCR primer pairs has been designed on genomic templates, cDNAs and sequences containing repetitive elements to analyze PRIDE's success rate. The high performance of PRIDE, combined with its minimal requirement of user interaction and its fast algorithm, make this program useful for the large scale design of primers, especially in large sequencing projects.
Collapse
Affiliation(s)
- S Haas
- Department of Molecular Genome Analysis and Department of Theoretical Bioinformatics,Deutsches Krebsforschungszentrum, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany.
| | | | | | | |
Collapse
|