Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Rivals E, Delgrange O, Delahaye JP, Dauchet M, Delorme MO, Hénaut A, Ollivier E. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Comput Appl Biosci 1997;13:131-6. [PMID: 9146959 DOI: 10.1093/bioinformatics/13.2.131] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]

For:	Rivals E, Delgrange O, Delahaye JP, Dauchet M, Delorme MO, Hénaut A, Ollivier E. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Comput Appl Biosci 1997;13:131-6. [PMID: 9146959 DOI: 10.1093/bioinformatics/13.2.131] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]

Number

Cited by Other Article(s)

Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022;12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open

Abstract

BACKGROUND

Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations.

FINDINGS

This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.

CONCLUSIONS

The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.

Collapse

Dingle K, Camargo CQ, Louis AA. Input-output maps are strongly biased towards simple outputs. Nat Commun 2018;9:761. [PMID: 29472533 PMCID: PMC5823903 DOI: 10.1038/s41467-018-03101-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 01/19/2018] [Indexed: 11/09/2022] Open

Abstract

Many systems in nature can be described using discrete input–output maps. Without knowing details about a map, there may seem to be no a priori reason to expect that a randomly chosen input would be more likely to generate one output over another. Here, by extending fundamental results from algorithmic information theory, we show instead that for many real-world maps, the a priori probability P(x) that randomly sampled inputs generate a particular output x decays exponentially with the approximate Kolmogorov complexity \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde K(x)$$\end{document}K~(x) of that output. These input–output maps are biased towards simplicity. We derive an upper bound P(x) ≲ \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2^{ - a\tilde K(x) - b}$$\end{document}2-aK~(x)-b, which is tight for most inputs. The constants a and b, as well as many properties of P(x), can be predicted with minimal knowledge of the map. We explore this strong bias towards simple outputs in systems ranging from the folding of RNA secondary structures to systems of coupled ordinary differential equations to a stochastic financial trading model.

Algorithmic information theory measures the complexity of strings. Here the authors provide a practical bound on the probability that a randomly generated computer program produces a given output of a given complexity and apply this upper bound to RNA folding and financial trading algorithms.

Collapse

Pellegrini M, Renda ME, Vecchio A. TRStalker: an efficient heuristic for finding fuzzy tandem repeats. ACTA ACUST UNITED AC 2010;26:i358-66. [PMID: 20529928 PMCID: PMC2881393 DOI: 10.1093/bioinformatics/btq209] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]

Abstract

Motivation: Genomes in higher eukaryotic organisms contain a substantial amount of repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage and are characterized by close spatial contiguity. They play an important role in several molecular regulatory mechanisms, and also in several diseases (e.g. in the group of trinucleotide repeat disorders). While for TRs with a low or medium level of divergence the current methods are rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases. Fuzzy TRs are also important as tools to shed light on the evolutionary history of the genome, where higher divergence correlates with more remote duplication events.

Results: We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently TRs that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions. To attain this goal, we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem is akin to the ‘generalized median string’ that is known to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.

Availability: TRStalker will be integrated in the web-based TRs Discovery Service (TReaDS) at bioalgo.iit.cnr.it.

Contact:marco.pellegrini@iit.cnr.it

Supplementary information:Supplementary data are available at Bioinformatics online.

Collapse

Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009;25:1575-86. [DOI: 10.1093/bioinformatics/btp117] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open

Kolpakov R, Bana G, Kucherov G. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res 2003;31:3672-8. [PMID: 12824391 PMCID: PMC169196 DOI: 10.1093/nar/gkg617] [Citation(s) in RCA: 221] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Rocha EPC. An appraisal of the potential for illegitimate recombination in bacterial genomes and its consequences: from duplications to genome reduction. Genome Res 2003;13:1123-32. [PMID: 12743022 PMCID: PMC403640 DOI: 10.1101/gr.966203] [Citation(s) in RCA: 59] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]

Xin C, Sam K, Ming L. A compression algorithm for DNA sequences. IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE : THE QUARTERLY MAGAZINE OF THE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY 2001;20:61-6. [PMID: 11494771 DOI: 10.1109/51.940049] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Danchin A, Guerdoux-Jamet P, Moszer I, Nitschké P. Mapping the bacterial cell architecture into the chromosome. Philos Trans R Soc Lond B Biol Sci 2000;355:179-90. [PMID: 10724454 PMCID: PMC1692725 DOI: 10.1098/rstb.2000.0557] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Abstract

A genome is not a simple collection of genes. We propose here that it can be viewed as being organized as a 'celluloculus' similar to the homunculus of preformists, but pertaining to the category of programmes (or algorithms) rather than to that of architectures or structures: a significant correlation exists between the distribution of genes along the chromosome and the physical architecture of the cell. We review here data supporting this observation, stressing physical constraints operating on the cell's architecture and dynamics, and their consequences in terms of gene and genome structure. If such a correlation exists, it derives from some selection pressure: simple and general physical principles acting at the level of the cell structure are discussed. As a first case in point we see the piling up of planar modules as a stable, entropy-driven, architectural principle that could be at the root of the coupling between the architecture of the cell and the location of genes at specific places in the chromosome. We propose that the specific organization of certain genes whose products have a general tendency to form easily planar modules is a general motor for architectural organization in the bacterial cell. A second mechanism, operating at the transcription level, is described that could account for the efficient building up of complex structures. As an organizing principle we suggest that exploration by biological polymers of the vast space of possible conformation states is constrained by anchoring points. In particular, we suggest that transcription does not always allow the 5'-end of the transcript to go free and explore the many conformations available, but that, in many cases, it remains linked to the transcribing RNA polymerase complex in such a way that loops of RNA, rather than threads with a free end, explore the surrounding medium. In bacteria, extension of the loops throughout the cytoplasm would therefore be mediated by the de novo synthesis of ribosomes in growing cells. Termination of transcription and mRNA turnover would accordingly be expected to be controlled by sequence features at both the 3'- and 5'-ends of the molecule. These concepts are discussed taking into account in vitro analysis of genome sequences and experimental data about cell compartmentalization, mRNA folding and turnover, as well as known structural features of protein and membrane complexes.

Collapse

Haas S, Vingron M, Poustka A, Wiemann S. Primer design for large scale sequencing. Nucleic Acids Res 1998;26:3006-12. [PMID: 9611248 PMCID: PMC147651 DOI: 10.1093/nar/26.12.3006] [Citation(s) in RCA: 61] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open