1
|
Brudno M, Medvedev P, Stoye J, De La Vega FM. A report on the 2009 SIG on short read sequencing and algorithms (Short-SIG). Bioinformatics 2009; 25:2863-4. [DOI: 10.1093/bioinformatics/btp525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
2
|
Abstract
This paper revisits the problem of sorting by reversals with tools developed in the context of detecting common intervals. Mixing the two approaches yields new definitions and algorithms for the reversal distance computations, that apply directly on the original permutation. Traditional constructions such as recasting the signed permutation as a positive permutation, or traversing the overlap graph to analyze its connected components, are replaced by elementary definitions in terms of intervals of the permutation. This yields simple linear time algorithms that identify the essential features in a single pass over the permutation and use only simple data structures like arrays and stacks.
Collapse
Affiliation(s)
- A Bergeron
- LaCIM, Université du Québec à Montréal, Canada
| | | | | |
Collapse
|
3
|
Pollard DA, Bergman CM, Stoye J, Celniker SE, Eisen MB. Correction: Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics 2004; 5:73. [PMID: 15186509 PMCID: PMC446183 DOI: 10.1186/1471-2105-5-73] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2004] [Accepted: 06/08/2004] [Indexed: 12/02/2022] Open
Affiliation(s)
- DA Pollard
- Biophysics Graduate Group, University of California, Berkeley, CA 94720, USA
| | - CM Bergman
- Department of Genome Science, Life Science Division, Lawrence Orlando Berkeley National Laboratory, Berkeley, CA 94720, USA
- Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- Current address: Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, UK
| | - J Stoye
- Technische Fakultät, Universität Bielefeld, 33594 Bielefeld, Germany
| | - SE Celniker
- Department of Genome Science, Life Science Division, Lawrence Orlando Berkeley National Laboratory, Berkeley, CA 94720, USA
- Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - MB Eisen
- Department of Genome Science, Life Science Division, Lawrence Orlando Berkeley National Laboratory, Berkeley, CA 94720, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
4
|
Abstract
Integrating different alignment strategies, a layout editor and tools deriving phylogenetic trees in a 'multiple alignment environment' helps to investigate and enhance results of multiple sequence alignment by hand. QAlign combines algorithms for fast progressive and accurate simultaneous multiple alignment with a versatile editor and a dynamic phylogenetic analysis in a convenient graphical user interface.
Collapse
Affiliation(s)
- M Sammeth
- Department of Computer Science II, University of Würzburg, 97074 Würzburg, Germany
| | | | | | | | | | | |
Collapse
|
5
|
Kurtz S, Choudhuri JV, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R. REPuter: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res 2001; 29:4633-42. [PMID: 11713313 PMCID: PMC92531 DOI: 10.1093/nar/29.22.4633] [Citation(s) in RCA: 1252] [Impact Index Per Article: 54.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2001] [Revised: 09/19/2001] [Accepted: 09/19/2001] [Indexed: 11/13/2022] Open
Abstract
The repetitive structure of genomic DNA holds many secrets to be discovered. A systematic study of repetitive DNA on a genomic or inter-genomic scale requires extensive algorithmic support. The REPuter program described herein was designed to serve as a fundamental tool in such studies. Efficient and complete detection of various types of repeats is provided together with an evaluation of significance and interactive visualization. This article circumscribes the wide scope of repeat analysis using applications in five different areas of sequence analysis: checking fragment assemblies, searching for low copy repeats, finding unique sequences, comparing gene structures and mapping of cDNA/EST sequences.
Collapse
Affiliation(s)
- S Kurtz
- Faculty of Technology, University of Bielefeld, PO Box 10 01 31, D-33501 Bielefeld, Germany.
| | | | | | | | | | | |
Collapse
|
6
|
Abstract
In physical mapping, one orders a set of genetic landmarks or a library of cloned fragments of DNA according to their position in the genome. Our approach to physical mapping divides the problem into smaller and easier subproblems by partitioning the probe set into independent parts (probe contigs). For this purpose we introduce a new distance function between probes, the averaged rank distance (ARD) derived from bootstrap resampling of the raw data. The ARD measures the pairwise distances of probes within a contig and smoothes the distances of probes across different contigs. It shows distinct jumps at contig borders. This makes it appropriate for contig selection by clustering. We have designed a physical mapping algorithm that makes use of these observations and seems to be particularly well suited to the delineation of reliable contigs. We evaluated our method on data sets from two physical mapping projects. On data from the recently sequenced bacterium Xylella fastidiosa, the probe contig set produced by the new method was evaluated using the probe order derived from the sequence information. Our approach yielded a basically correct contig set. On this data we also compared our method to an approach which uses the number of supporting clones to determine contigs. Our map is much more accurate. In comparison to a physical map of Pasteurella haemolytica that was computed using simulated annealing, the newly computed map is considerably cleaner. The results of our method have already proven helpful for the design of experiments aimed at further improving the quality of a map.
Collapse
Affiliation(s)
- S Heber
- German Cancer Research Center (DKFZ), Theoretical Bioinformatics (H0300), Heidelberg, Germany.
| | | | | | | | | |
Collapse
|
7
|
Kurtz S, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R. Computation and visualization of degenerate repeats in complete genomes. Proc Int Conf Intell Syst Mol Biol 2001; 8:228-38. [PMID: 10977084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
The repetitive structure of genomic DNA holds many secrets to be discovered. A systematic study of repetitive DNA on a genomic or inter-genomic scale requires extensive algorithmic support. The REPuter family of programs described herein was designed to serve as a fundamental tool in such studies. Efficient and complete detection of various types of repeats is provided together with an evaluation of significance, interactive visualization, and simple interfacing to other analysis programs.
Collapse
Affiliation(s)
- S Kurtz
- Faculty of Technology, University of Bielefeld, Germany.
| | | | | | | | | |
Collapse
|
8
|
Abstract
MOTIVATION Multiple sequence alignment is an important tool in computational biology. In order to solve the task of computing multiple alignments in affordable time, the most commonly used multiple alignment methods have to use heuristics. Nevertheless, the computation of optimal multiple alignments is important in its own right, and it provides a means of evaluating heuristic approaches or serves as a subprocedure of heuristic alignment methods. RESULTS We present an algorithm that uses the divide-and-conquer alignment approach together with recent results on search space reduction to speed up the computation of multiple sequence alignments. The method is adaptive in that depending on the time one wants to spend on the alignment, a better, up to optimal alignment can be obtained. To speed up the computation in the optimal alignment step, we apply the alpha(*) algorithm which leads to a procedure provably more efficient than previous exact algorithms. We also describe our implementation of the algorithm and present results showing the effectiveness and limitations of the procedure.
Collapse
Affiliation(s)
- K Reinert
- Celera Genomics, Informatics Research, 45 West Gude Drive, Rockville, MD 20850, USA
| | | | | |
Collapse
|
9
|
Spang R, Rehmsmeier M, Stoye J. Sequence database search using jumping alignments. Proc Int Conf Intell Syst Mol Biol 2000; 8:367-75. [PMID: 10977097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 04/15/2023]
Abstract
We describe a new algorithm for amino acid sequence classification and the detection of remote homologues. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well balanced manner. This is in contrast to established methods like profiles and hidden Markov models which focus on vertical information as they model the columns of the alignment independently. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence at each position is aligned to one sequence of the multiple alignment, called the "reference sequence". In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compared it to hidden Markov models on a subset of the SCOP database of protein domains. The discriminative quality was assessed by counting the number of false positives that ranked higher than the first true positive (FP-count). For moderate FP-counts above five, the number of successful searches with our method was considerably higher than with hidden Markov models.
Collapse
Affiliation(s)
- R Spang
- German Cancer Research Center (DKFZ), Theoretical Bioinformatics, Heidelberg, Germany
| | | | | |
Collapse
|
10
|
Krause A, Stoye J, Vingron M. The SYSTERS protein sequence cluster set. Nucleic Acids Res 2000; 28:270-2. [PMID: 10592244 PMCID: PMC102384 DOI: 10.1093/nar/28.1.270] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/1999] [Revised: 09/17/1999] [Accepted: 10/04/1999] [Indexed: 11/13/2022] Open
Abstract
The SYSTERS (short for SYSTEmatic Re-Searching) protein sequence cluster set consists of the classification of all sequences from SWISS-PROT and PIR into disjoint protein family clusters and hierarchically into superfamily and subfamily clusters. The cluster set can be searched with a sequence using the SSMAL search tool or a traditional database search tool like BLAST or FASTA. Additionally a multiple alignment is generated for each cluster and annotated with domain information from the Pfam database of protein domain families. A taxonomic overview of the organisms covered by a cluster is given based on the NCBI taxonomy. The cluster set is available for querying and browsing at http://www.dkfz-heidelberg. de/tbi/services/cluster/systersform
Collapse
Affiliation(s)
- A Krause
- Deutsches Krebsforschungszentrum, Theoretische Bioinformatik, Im Neuenheimer Feld 280, D-69120 Heidelberg, Germany.
| | | | | |
Collapse
|
11
|
Affiliation(s)
- J Stoye
- Division of Virology, National Institute for Medical Research, London, UK
| |
Collapse
|
12
|
Abstract
MOTIVATION We present a new probabilistic model of the evolution of RNA-, DNA-, or protein-like sequences and a software tool, Rose, that implements this model. Guided by an evolutionary tree, a family of related sequences is created from a common ancestor sequence by insertion, deletion and substitution of characters. During this artificial evolutionary process, the 'true' history is logged and the 'correct' multiple sequence alignment is created simultaneously. The model also allows for varying rates of mutation within the sequences, making it possible to establish so-called sequence motifs. RESULTS The data created by Rose are suitable for the evaluation of methods in multiple sequence alignment computation and the prediction of phylogenetic relationships. It can also be useful when teaching courses in or developing models of sequence evolution and in the study of evolutionary processes. AVAILABILITY Rose is available on the Bielefeld Bioinformatics WebServer under the following URL: http://bibiserv.TechFak.Uni-Bielefeld.DE/rose/ The source code is available upon request. CONTACT folker@TechFak.Uni-Bielefeld.DE
Collapse
Affiliation(s)
- J Stoye
- Research Center for Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld, Postfach, Germany
| | | | | |
Collapse
|
13
|
Abstract
An improved algorithm for the simultaneous alignment of multiple protein and nucleic acid sequences, the Divide-and-Conquer Alignment procedure (DCA), is presented. The basic method described in Tönges,et al. (1996) (Tönges, U., Perrey, S.W., Stoye, J., Dress, A.W.M., 1996. A general method for fast multiple sequence alignment. Gene, 172, GC33-GC41) is generalized to align any number of sequences to work arbitrary (e.g. affine linear) gap penalty functions. Also, the practical efficiency of the method is improved so that families of more than 10 sequences can now be aligned simultaneously within a few seconds or minutes. After a brief description of the general method, we assess the time and memory requirements of our implementation of DCA. We present several examples showing that the program is able to deal with real-world alignment problems.
Collapse
Affiliation(s)
- J Stoye
- Research Center for Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld, Postfach 10 01 31, D-33501 Bielefeld, Germany.
| |
Collapse
|
14
|
Stoye J, Moulton V, Dress AW. DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput Appl Biosci 1997; 13:625-6. [PMID: 9475994 DOI: 10.1093/bioinformatics/13.6.625] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
MOTIVATION DCA is a new computer program for multiple sequence alignment which utilizes a 'divide-and-conquer' type of heuristic approach. AVAILABILITY The algorithm is freely available from http://bibiserv.TechFak.Uni-Bielefeld.DE/dca/.
Collapse
Affiliation(s)
- J Stoye
- Research Center for Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld, Germany.
| | | | | |
Collapse
|
15
|
Stoye J, Evers D, Meyer F. Generating benchmarks for multiple sequence alignments and phylogenetic reconstructions. Proc Int Conf Intell Syst Mol Biol 1997; 5:303-6. [PMID: 9322053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
We present a new probabilistic model of evolution of RNA-, DNA-, or protein-like sequences and a tool rose that implements this model. By insertion, deletion and substitution of characters, a family of sequences is created from a common ancestor. During this artificial evolutionary process, the "true" history is logged and the "correct" multiple sequence alignment is created simultaneously. We also allow for varying rates of mutation within the sequences making it possible to establish so-called sequence motifs. The results are suitable for the evaluation of methods in multiple sequence alignment computation and the prediction of phylogenetic relationships.
Collapse
Affiliation(s)
- J Stoye
- Research Center for Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld, Germany.
| | | | | |
Collapse
|
16
|
Abstract
We have developed a fast heuristic algorithm for multiple sequence alignment which provides near-to-optimal results for sufficiently homologous sequences. The algorithm makes use of the standard dynamic programming procedure by applying it to all pairs of sequences. The resulting score matrices for pair-wise alignment give rise to secondary matrices containing the additional charges imposed by forcing the alignment path to run through a particular vertex. Such a constraint corresponds to slicing the sequences at the positions defining that vertex, and aligning the remaining pairs of prefix and suffix sequences separately. From these secondary matrices, one can compute-for any given family of sequences-suitable positions for cutting all of these sequences simultaneously, thus reducing the problem of aligning a family of n sequences of average length l in a Divide and Conquer fashion to aligning two families of n sequences of approximately half that length. In this paper, we explain the method for the case of 3 sequences in detail, and we demonstrate its potential and its limits by discussing its behaviour for several test families. A generalization for aligning more than 3 sequences is lined out, and some actual alignments constructed by our algorithm for various user-defined parameters are presented.
Collapse
Affiliation(s)
- U Tönges
- Research Center for Interdisciplinary Studies on Structure Formation (RCSF), University of Bielefeld, Germany. toenges,stoye,
| | | | | | | |
Collapse
|
17
|
Affiliation(s)
- C M Abbott
- Department of Genetics and Biochemistry, UCL, London, UK
| | | | | | | | | | | | | | | | | |
Collapse
|
18
|
DeLamarter JF, Stoye J, Schumann G, Moroni C. Evidence supporting a physiological role for endogenous C-type virus in the immune system. Haematol Blood Transfus 1979; 23:413-5. [PMID: 232464 DOI: 10.1007/978-3-642-67057-2_54] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|