1
|
Zhan Q, Fu Y, Jiang Q, Liu B, Peng J, Wang Y. SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically. Protein Pept Lett 2020; 27:295-302. [PMID: 31385760 DOI: 10.2174/0929866526666190806143959] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2019] [Revised: 04/26/2019] [Accepted: 06/14/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. OBJECTIVE In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. METHODS Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. RESULTS We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. CONCLUSION The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yilei Fu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
2
|
Gao L, Bao W, Zhang H, Yuan CA, Huang DS. Fast sequence analysis based on diamond sampling. PLoS One 2018; 13:e0198922. [PMID: 29953448 PMCID: PMC6023231 DOI: 10.1371/journal.pone.0198922] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Accepted: 05/29/2018] [Indexed: 12/02/2022] Open
Abstract
Both in DNA and protein contexts, an important method for modelling motifs is to utilize position weight matrix (PWM) in biological sequences. With the development of genome sequencing technology, the quantity of the sequence data is increasing explosively, so the faster searching algorithms which have the ability to meet the increasingly need are desired to develop. In this paper, we proposed a method for speeding up the searching process of candidate transcription factor binding sites (TFBS), and the users can be allowed to specify p threshold to get the desired trade-off between speed and sensitivity for a particular sequence analysis. Moreover, the proposed method can also be generalized to large-scale annotation and sequence projects.
Collapse
Affiliation(s)
- Liangxin Gao
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Wenzhen Bao
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hongbo Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Chang-An Yuan
- Science Computing and Intelligent Information Processing of GuangXi Higher Education Key Laboratory, Guangxi Teachers Education University, Nanning, Guangxi, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
3
|
Afshar PT, Wong WH. COSINE: non-seeding method for mapping long noisy sequences. Nucleic Acids Res 2017; 45:e132. [PMID: 28586438 PMCID: PMC5737678 DOI: 10.1093/nar/gkx511] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2016] [Revised: 05/16/2017] [Accepted: 06/04/2017] [Indexed: 11/20/2022] Open
Abstract
Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors. COSINE computes the context similarity of two stretches of nucleobases given the similarity over distributions of their short k-mers (k = 3-4) along the sequences. The results on simulated and real data show that COSINE achieves high sensitivity and specificity under a wide range of read accuracies. When the error rate is high, COSINE can offer substantial advantages over existing alignment methods.
Collapse
Affiliation(s)
- Pegah Tootoonchi Afshar
- Department of Electrical Engineering, School of Engineering, Stanford University, Stanford, CA 94305, USA
| | - Wing Hung Wong
- Department of Statistics and Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
4
|
Qiao W, Takayanagi K, Niu Q, Shofie M, Li YY. Long-term stability of thermophilic co-digestion submerged anaerobic membrane reactor encountering high organic loading rate, persistent propionate and detectable hydrogen in biogas. BIORESOURCE TECHNOLOGY 2013; 149:92-102. [PMID: 24090872 DOI: 10.1016/j.biortech.2013.09.023] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2013] [Revised: 09/02/2013] [Accepted: 09/04/2013] [Indexed: 06/02/2023]
Abstract
The performance of thermophilic anaerobic co-digestion of coffee grounds and sludge using membrane reactor was investigated for 148 days, out of a total research duration of 263 days. The OLR was increased from 2.2 to 33.7 kg-COD/m(3)d and HRT was shortened from 70 to 7 days. A significant irreversible drop in pH confirmed the overload of reactor. Under a moderately high OLR of 23.6 kg-COD/m(3)d, and with HRT and influent total solids of 10 days and 150 g/L, respectively, the COD removal efficiency was 44.5%. Hydrogen in biogas was around 100-200 ppm, which resulted in the persistent propionate of 1.0-3.2g/L. The VFA consumed approximately 60% of the total alkalinity. NH4HCO3 was supplemented to maintain alkalinity. The stability of system relied on pH management under steady state. The 16SrDNA results showed that hydrogen-utilizing methanogens dominates the archaeal community. The propionate-oxidizing bacteria in bacterial community was insufficient.
Collapse
Affiliation(s)
- Wei Qiao
- State Key Laboratory of Heavy Oil Processing, China University of Petroleum, PR China; Department of Civil and Environmental Engineering, Graduate School of Engineering, Tohoku University, Japan.
| | | | | | | | | |
Collapse
|
5
|
Kaznadzey A, Alexandrova N, Novichkov V, Kaznadzey D. PSimScan: algorithm and utility for fast protein similarity search. PLoS One 2013; 8:e58505. [PMID: 23505522 PMCID: PMC3591303 DOI: 10.1371/journal.pone.0058505] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2012] [Accepted: 02/07/2013] [Indexed: 01/19/2023] Open
Abstract
In the era of metagenomics and diagnostics sequencing, the importance of protein comparison methods of boosted performance cannot be overstated. Here we present PSimScan (Protein Similarity Scanner), a flexible open source protein similarity search tool which provides a significant gain in speed compared to BLASTP at the price of controlled sensitivity loss. The PSimScan algorithm introduces a number of novel performance optimization methods that can be further used by the community to improve the speed and lower hardware requirements of bioinformatics software. The optimization starts at the lookup table construction, then the initial lookup table–based hits are passed through a pipeline of filtering and aggregation routines of increasing computational complexity. The first step in this pipeline is a novel algorithm that builds and selects ‘similarity zones’ aggregated from neighboring matches on small arrays of adjacent diagonals. PSimScan performs 5 to 100 times faster than the standard NCBI BLASTP, depending on chosen parameters, and runs on commodity hardware. Its sensitivity and selectivity at the slowest settings are comparable to the NCBI BLASTP’s and decrease with the increase of speed, yet stay at the levels reasonable for many tasks. PSimScan is most advantageous when used on large collections of query sequences. Comparing the entire proteome of Streptocuccus pneumoniae (2,042 proteins) to the NCBI’s non-redundant protein database of 16,971,855 records takes 6.5 hours on a moderately powerful PC, while the same task with the NCBI BLASTP takes over 66 hours. We describe innovations in the PSimScan algorithm in considerable detail to encourage bioinformaticians to improve on the tool and to use the innovations in their own software development.
Collapse
Affiliation(s)
- Anna Kaznadzey
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
| | - Natalia Alexandrova
- Genome Designs, Inc., Walnut Creek, California, United States of America
- * E-mail:
| | | | - Denis Kaznadzey
- DOE Joint Genome Institute, Walnut Creek, California, United States of America
| |
Collapse
|
6
|
Abstract
A hypercomplex representation of DNA is proposed to facilitate comparing DNA sequences with fuzzy composition. With the hypercomplex number representation, the conventional sequence analysis method, such as, dot matrix analysis, dynamic programming, and cross-correlation method have been extended and improved to align DNA sequences with fuzzy composition. The hypercomplex dot matrix analysis can provide more control over the degree of alignment desired. A new scoring system has been proposed to accommodate the hypercomplex number representation of DNA and integrated with dynamic programming alignment method. By using hypercomplex cross-correlation, the match and mismatch alignment information between two aligned DNA sequences are separately stored in the resultant real part and imaginary parts respectively. The mismatch alignment information is very useful to refine consensus sequence based motif scanning.
Collapse
Affiliation(s)
- JIAN-JUN SHU
- School of Mechanical and Aerospace Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
| | - YAJING LI
- School of Mechanical and Aerospace Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore
| |
Collapse
|
7
|
Pizzi C, Rastas P, Ukkonen E. Finding significant matches of position weight matrices in linear time. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:69-79. [PMID: 21071798 DOI: 10.1109/tcbb.2009.35] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.
Collapse
Affiliation(s)
- Cinzia Pizzi
- Department of Information Engineering, Università degli Studi di Padova, via Gradenigo 6/b, 35131 Padova, Italy.
| | | | | |
Collapse
|
8
|
Ye J, Su LH, Chen CL, Hu S, Wang J, Yu J, Chiu CH. Analysis of pSC138, the multidrug resistance plasmid of Salmonella enterica serotype Choleraesuis SC-B67. Plasmid 2010; 65:132-40. [PMID: 21111756 DOI: 10.1016/j.plasmid.2010.11.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2009] [Revised: 10/19/2010] [Accepted: 11/21/2010] [Indexed: 11/16/2022]
Abstract
Salmonella enterica serotype Choleraesuis (S. Choleraesuis) usually causes systemic infections in man and needs antimicrobial treatment. Multidrug resistance (MDR) in S. Choleraesuis is thus a great concern in the treatment of systemic non-typhoid salmonellosis. A large plasmid, pSC138, was identified in 2002 from a S. Choleraesuis strain SC-B67 that was resistant to all antimicrobial agents commonly used to treat salmonellosis, including ciprofloxacin and ceftriaxone. Complete DNA sequence of the plasmid had been determined previously (Chiu et al., 2005). In the present study, the sequence of pSC138 was reannotated in detail and compared with several newly sequenced plasmids. Some transposable elements and drug resistance genes were further delineated. Plasmid pSC138 was 138,742 bp in length and consisted of 177 open reading frames (ORFs). While 134 of the ORFs displayed significant identity levels to other plasmid and prokaryotic sequences, the remaining 43 ORFs have not been previously reported. Mobile elements, including two integrons, seven insertion sequences and eight transposons, and a truncated prophage together encompass at least 66,781 bp (48.1%) of the plasmid genome. The sequence of pSC138 consists of three major regions: a large composite transposable region Tn6088 with a Tn21-like backbone inserted by a variety of integrons or transposable elements; a transfer/maintenance region that contains a conserved ISEcp1-mediated transposon-like element Tn6092, carrying an AmpC gene, bla(CMY-2), that confers the ceftriaxone resistance; and a Rep_3 type of replication region. Another seven bacteremic strains of S. Choleraesuis that expressed the same MDR phenotype were identified during 2003-2008. The same Rep_3 type replicase and the bla(CMY-2)-containing, ISEcp1-mediated transposon-like element were found in the MDR isolates, suggesting a successful preservation and dissemination of the MDR plasmid. Comparison of pSC138 with other recently published plasmids revealed a high identity level between partial sequences of pSC138 and plasmids of the same or different incompatibility groups. The large MDR region found in pSC138 may provide a niche for the future evolution of the plasmid by acquisition of relevant resistance genes through the panoply of mobile elements and illegitimate recombination events.
Collapse
Affiliation(s)
- Jiehua Ye
- James D. Watson Institute of Genome Sciences, Zhejiang University, Hangzhou, China
| | | | | | | | | | | | | |
Collapse
|
9
|
Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006; 7:389. [PMID: 16930469 PMCID: PMC1635428 DOI: 10.1186/1471-2105-7-389] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2006] [Accepted: 08/24/2006] [Indexed: 11/10/2022] Open
Abstract
Background In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|m + m - 1, where m is the length of the PSSM and A
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
Collapse
Affiliation(s)
- Michael Beckstette
- International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBITec), Bielefeld University, D-33594 Bielefeld, Germany
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Robert Homann
- International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBITec), Bielefeld University, D-33594 Bielefeld, Germany
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Robert Giegerich
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Stefan Kurtz
- Zentrum für Bioinformatik, Universität Hamburg, 20146 Hamburg, Germany
| |
Collapse
|
10
|
Abstract
MOTIVATION Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P << N. RESULTS In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.
Collapse
Affiliation(s)
- Valerio Freschi
- Information Science and Technology Institute, University of Urbino, 61029 Urbino, Italy
| | | |
Collapse
|
11
|
Katoh K, Misawa K, Kuma KI, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002; 30:3059-66. [PMID: 12136088 PMCID: PMC135756 DOI: 10.1093/nar/gkf436] [Citation(s) in RCA: 9373] [Impact Index Per Article: 426.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homo logous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.
Collapse
Affiliation(s)
- Kazutaka Katoh
- Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan
| | | | | | | |
Collapse
|