1
|
João M, Sena AC, Rebello VEF. On closing the inopportune gap with consistency transformation and iterative refinement. PLoS One 2023; 18:e0287483. [PMID: 37440507 PMCID: PMC10343097 DOI: 10.1371/journal.pone.0287483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 06/06/2023] [Indexed: 07/15/2023] Open
Abstract
The problem of aligning multiple biological sequences has fascinated scientists for a long time. Over the last four decades, tens of heuristic-based Multiple Sequence Alignment (MSA) tools have been proposed, the vast majority being built on the concept of Progressive Alignment. It is known, however, that this approach suffers from an inherent drawback regarding the inadvertent insertion of gaps when aligning sequences. Two well-known corrective solutions have frequently been adopted to help mitigate this: Consistency Transformation and Iterative Refinement. This paper takes a tool-independent technique-oriented look at the alignment quality benefits of these two strategies using problem instances from the HOMSTRAD and BAliBASE benchmarks. Eighty MSA aligners have been used to compare 4 classes of heuristics: Progressive Alignments, Iterative Alignments, Consistency-based Alignments, and Consistency-based Progressive Alignments with Iterative Refinement. Statistically, while both Consistency-based classes are better for alignments with low similarity, for sequences with higher similarity, the differences between the classes are less clear. Iterative Refinement has its own drawbacks resulting in there being statistically little advantage for Progressive Aligners to adopt this technique either with Consistency Transformation or without. Nevertheless, all 4 classes are capable of bettering each other, depending on the instance problem. This further motivates the development of MSA frameworks, such as the one being developed for this research, which simultaneously contemplate multiple classes and techniques in their attempt to uncover better solutions.
Collapse
Affiliation(s)
- Mario João
- Medical Sciences College, State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
- Institute of Computing, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| | - Alexandre C Sena
- Institute of Mathematics and Statistics, State University of Rio de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil
| | - Vinod E F Rebello
- Institute of Computing, Fluminense Federal University, Niterói, Rio de Janeiro, Brazil
| |
Collapse
|
2
|
Daviet B, Fernandez R, Cabrera-Bosquet L, Pradal C, Fournier C. PhenoTrack3D: an automatic high-throughput phenotyping pipeline to track maize organs over time. PLANT METHODS 2022; 18:130. [PMID: 36482291 PMCID: PMC9730636 DOI: 10.1186/s13007-022-00961-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 11/22/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND High-throughput phenotyping platforms allow the study of the form and function of a large number of genotypes subjected to different growing conditions (GxE). A number of image acquisition and processing pipelines have been developed to automate this process, for micro-plots in the field and for individual plants in controlled conditions. Capturing shoot development requires extracting from images both the evolution of the 3D plant architecture as a whole, and a temporal tracking of the growth of its organs. RESULTS We propose PhenoTrack3D, a new pipeline to extract a 3D + t reconstruction of maize. It allows the study of plant architecture and individual organ development over time during the entire growth cycle. The method tracks the development of each organ from a time-series of plants whose organs have already been segmented in 3D using existing methods, such as Phenomenal [Artzet et al. in BioRxiv 1:805739, 2019] which was chosen in this study. First, a novel stem detection method based on deep-learning is used to locate precisely the point of separation between ligulated and growing leaves. Second, a new and original multiple sequence alignment algorithm has been developed to perform the temporal tracking of ligulated leaves, which have a consistent geometry over time and an unambiguous topological position. Finally, growing leaves are back-tracked with a distance-based approach. This pipeline is validated on a challenging dataset of 60 maize hybrids imaged daily from emergence to maturity in the PhenoArch platform (ca. 250,000 images). Stem tip was precisely detected over time (RMSE < 2.1 cm). 97.7% and 85.3% of ligulated and growing leaves respectively were assigned to the correct rank after tracking, on 30 plants × 43 dates. The pipeline allowed to extract various development and architecture traits at organ level, with good correlation to manual observations overall, on random subsets of 10-355 plants. CONCLUSIONS We developed a novel phenotyping method based on sequence alignment and deep-learning. It allows to characterise the development of maize architecture at organ level, automatically and at a high-throughput. It has been validated on hundreds of plants during the entire development cycle, showing its applicability on GxE analyses of large maize datasets.
Collapse
Affiliation(s)
- Benoit Daviet
- LEPSE, Univ Montpellier, INRAE, Institut Agro, Montpellier, France
| | - Romain Fernandez
- CIRAD, UMR AGAP Institut, 34398, Montpellier, France
- CIRAD, INRAE, UMR AGAP Institut, Univ Montpellier, Institut Agro, 34398, Montpellier, France
| | | | - Christophe Pradal
- CIRAD, UMR AGAP Institut, 34398, Montpellier, France.
- CIRAD, INRAE, UMR AGAP Institut, Univ Montpellier, Institut Agro, 34398, Montpellier, France.
- Inria & LIRMM, CNRS, Univ Montpellier, Montpellier, France.
| | | |
Collapse
|
3
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
4
|
Slipknot or Crystallographic Error: A Computational Analysis of the Plasmodium falciparum DHFR Structural Folds. Int J Mol Sci 2022; 23:ijms23031514. [PMID: 35163439 PMCID: PMC8835989 DOI: 10.3390/ijms23031514] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 01/21/2022] [Accepted: 01/25/2022] [Indexed: 01/12/2023] Open
Abstract
The presence of protein structures with atypical folds in the Protein Data Bank (PDB) is rare and may result from naturally occurring knots or crystallographic errors. Proper characterisation of such folds is imperative to understanding the basis of naturally existing knots and correcting crystallographic errors. If left uncorrected, such errors can frustrate downstream experiments that depend on the structures containing them. An atypical fold has been identified in P. falciparum dihydrofolate reductase (PfDHFR) between residues 20–51 (loop 1) and residues 191–205 (loop 2). This enzyme is key to drug discovery efforts in the parasite, necessitating a thorough characterisation of these folds. Using multiple sequence alignments (MSA), a unique insert was identified in loop 1 that exacerbates the appearance of the atypical fold-giving it a slipknot-like topology. However, PfDHFR has not been deposited in the knotted proteins database, and processing its structure failed to identify any knots within its folds. The application of protein homology modelling and molecular dynamics simulations on the DHFR domain of P. falciparum and those of two other organisms (E. coli and M. tuberculosis) that were used as molecular replacement templates in solving the PfDHFR structure revealed plausible unentangled or open conformations of these loops. These results will serve as guides for crystallographic experiments to provide further insights into the atypical folds identified.
Collapse
|
5
|
Biological sequence analysis. Bioinformatics 2022. [DOI: 10.1016/b978-0-323-89775-4.00003-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
6
|
Li Y. Sequence Alignment with Q-Learning Based on the Actor-Critic Model. ACM T ASIAN LOW-RESO 2021. [DOI: 10.1145/3433540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Multiple sequence alignment methods refer to a series of algorithmic solutions for the alignment of evolutionary-related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. In this article, we propose a method with Q-learning based on the Actor-Critic model for sequence alignment. We transform the sequence alignment problem into an agent's autonomous learning process. In this process, the reward of the possible next action taken is calculated, and the cumulative reward of the entire process is calculated. The results show that the method we propose is better than the gene algorithm and the dynamic programming method.
Collapse
Affiliation(s)
- Yarong Li
- The Experimental High School Attached to Beijing Normal University, Beijing, China
| |
Collapse
|
7
|
Hu T, Li J, Zhou H, Li C, Holmes EC, Shi W. Bioinformatics resources for SARS-CoV-2 discovery and surveillance. Brief Bioinform 2021; 22:631-641. [PMID: 33416890 PMCID: PMC7929396 DOI: 10.1093/bib/bbaa386] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2020] [Revised: 11/10/2020] [Accepted: 11/27/2020] [Indexed: 12/22/2022] Open
Abstract
In early January 2020, the novel coronavirus (SARS-CoV-2) responsible for a pneumonia outbreak in Wuhan, China, was identified using next-generation sequencing (NGS) and readily available bioinformatics pipelines. In addition to virus discovery, these NGS technologies and bioinformatics resources are currently being employed for ongoing genomic surveillance of SARS-CoV-2 worldwide, tracking its spread, evolution and patterns of variation on a global scale. In this review, we summarize the bioinformatics resources used for the discovery and surveillance of SARS-CoV-2. We also discuss the advantages and disadvantages of these bioinformatics resources and highlight areas where additional technical developments are urgently needed. Solutions to these problems will be beneficial not only to the prevention and control of the current COVID-19 pandemic but also to infectious disease outbreaks of the future.
Collapse
Affiliation(s)
- Tao Hu
- Shandong First Medical University, China
| | - Juan Li
- Shandong First Medical University, China
| | - Hong Zhou
- Shandong First Medical University, China
| | - Cixiu Li
- Shandong First Medical University, China
| | | | | |
Collapse
|
8
|
Paul L, Mudogo CN, Mtei KM, Machunda RL, Ntie-Kang F. A computer-based approach for developing linamarase inhibitory agents. PHYSICAL SCIENCES REVIEWS 2020. [DOI: 10.1515/psr-2019-0098] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
AbstractCassava is a strategic crop, especially for developing countries. However, the presence of cyanogenic compounds in cassava products limits the proper nutrients utilization. Due to the poor availability of structure discovery and elucidation in the Protein Data Bank is limiting the full understanding of the enzyme, how to inhibit it and applications in different fields. There is a need to solve the three-dimensional structure (3-D) of linamarase from cassava. The structural elucidation will allow the development of a competitive inhibitor and various industrial applications of the enzyme. The goal of this review is to summarize and present the available 3-D modeling structure of linamarase enzyme using different computational strategies. This approach could help in determining the structure of linamarase and later guide the structure elucidationin silicoand experimentally.
Collapse
Affiliation(s)
- Lucas Paul
- The Department of Materials and Energy Science & Engineering, The Nelson Mandela African Institution of Science and Technology, P.O. Box 447Arusha, Tanzania
- Department of Chemistry, Dar es Salaam University College of Education, P.O. Box 2329, 255Dar es Salaam, Tanzania
| | - Celestin N. Mudogo
- Biochemistry and Molecularbiology, University of Hamburg Institute of Biochemistry and Molecularbiology, Hamburg, Germany
- Department of Basic Sciences, School of Medicine, University of Kinshasa, Kinshasa, Congo (Democratic Republic of the)
| | - Kelvin M. Mtei
- The Department of Water and Environmental Science and Engineering, The Nelson Mandela African Institution of Science and Technology, P.O. Box 447Arusha, Tanzania
| | - Revocatus L. Machunda
- The Department of Water and Environmental Science and Engineering, The Nelson Mandela African Institution of Science and Technology, P.O. Box 447Arusha, Tanzania
| | - Fidele Ntie-Kang
- Department of Pharmaceutical Chemistry, Martin-Luther University Halle-Wittenberg, Wolfgang-Langenbeck Str. 4, Halle (Saale)06120, Germany
- Department of Informatics and Chemistry, University of Chemistry and Technology Prague, Technická 5, Prague 6, Dejvice 166 28, Czech Republic
- Department of Chemistry, University of Buea, P. O. Box 63Buea, Cameroon
| |
Collapse
|
9
|
A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm. Soft comput 2020. [DOI: 10.1007/s00500-020-04917-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
10
|
|
11
|
Abstract
Phenotypic sequences are a type of multivariate trait organized structurally, such as teeth distributed along the dental arch, or temporally, such as the stages of an ontogenetic series. Unlike other multivariate traits, the elements of a phenotypic sequence are distributed along an ordered set, which allows for distinct evolutionary patterns between neighboring and distant positions. In fact, sequence traits share many characteristics with molecular sequences, although important distinctions pose challenges to current comparative methods. We implement an approach to estimate rates of trait evolution that explicitly incorporates the sequence organization of traits. We apply models to study the temporal pattern evolution of cricket calling songs. We test whether neighboring positions along a phenotypic sequence have correlated rates of evolution or whether rate variation is independent of sequence position. Our results show that cricket song evolution is strongly autocorrelated and that models perform well when used with sequence phenotypes even under small sample sizes. Our approach is flexible and can be applied to any multivariate trait with discrete units organized in a sequence-like structure.
Collapse
|
12
|
Daoud M. The extension of the largest generalized-eigenvalue based distance metric D_ij (γ_1 ) in arbitrary feature spaces to classify composite data points. Genomics Inform 2019; 17:e39. [PMID: 31896239 PMCID: PMC6944050 DOI: 10.5808/gi.2019.17.4.e39] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 10/14/2019] [Accepted: 10/14/2019] [Indexed: 11/20/2022] Open
Abstract
Analyzing patterns in data points embedded in linear and non-linear feature spaces is considered as one of the common research problems among different research areas, for example: data mining, machine learning, pattern recognition, and multivariate analysis. In this paper, data points are heterogeneous sets of biosequences (composite data points). A composite data point is a set of ordinary data points (e.g., set of feature vectors). We theoretically extend the derivation of the largest generalized eigenvalue-based distance metric D_ij (γ_1 ) in any linear and non-linear feature spaces. We prove that D_ij (γ_1 ) is a metric under any linear and non-linear feature transformation function. We show the sufficiency and efficiency of using the decision rule δ ̅_(Ξ_i ) (i.e., mean of D_ij (γ_1 )) in classification of heterogeneous sets of biosequences compared with the decision rules min_(Ξ_i )and median_(Ξ_i ). We analyze the impact of linear and non-linear transformation functions on classifying/clustering collections of heterogeneous sets of biosequences. The impact of the length of a sequence in a heterogeneous sequence-set generated by simulation on the classification and clustering results in linear and non-linear feature spaces is empirically shown in this paper. We propose a new concept: the limiting dispersion map of the existing clusters in heterogeneous sets of biosequences embedded in linear and nonlinear feature spaces, which is based on the limiting distribution of nucleotide compositions estimated from real data sets. Finally, the empirical conclusions and the scientific evidences are deduced from the experiments to support the theoretical side stated in this paper.
Collapse
Affiliation(s)
- Mosaab Daoud
- Independent Research Scientist, Toronto, ON M1S1G2, Canada
| |
Collapse
|
13
|
ERES: an extended regular expression signature for polymorphic worm detection. JOURNAL OF COMPUTER VIROLOGY AND HACKING TECHNIQUES 2019. [DOI: 10.1007/s11416-019-00330-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
14
|
Abstract
Background Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. Results Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. Conclusions These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. Electronic supplementary material The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yingying Wang
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China
| | - Hongyan Wu
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| | - Yunpeng Cai
- Research Center for Biomedical Information Technology, Shenzhen Institutes of Advanced Technologies, Chinese Academy of Sciences, Shenzhen, China.
| |
Collapse
|
15
|
Rubio-Largo Á, Vanneschi L, Castelli M, Vega-Rodríguez MA. Multiobjective characteristic-based framework for very-large multiple sequence alignment. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.06.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
16
|
Rubio-Largo Á, Castelli M, Vanneschi L, Vega-Rodríguez MA. A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment. J Comput Biol 2018; 25:1009-1022. [PMID: 29671616 DOI: 10.1089/cmb.2018.0031] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The alignment among three or more nucleotides/amino acids sequences at the same time is known as multiple sequence alignment (MSA), a nondeterministic polynomial time (NP)-hard optimization problem. The time complexity of finding an optimal alignment raises exponentially when the number of sequences to align increases. In this work, we deal with a multiobjective version of the MSA problem wherein the goal is to simultaneously optimize the accuracy and conservation of the alignment. A parallel version of the hybrid multiobjective memetic metaheuristics for MSA is proposed. To evaluate the parallel performance of our proposal, we have selected a pull of data sets with different number of sequences (up to 1000 sequences) and study its parallel performance against other well-known parallel metaheuristics published in the literature, such as MSAProbs, tree-based consistency objective function for alignment evaluation (T-Coffee), Clustal [Formula: see text], and multiple alignment using fast Fourier transform (MAFFT). The comparative study reveals that our parallel aligner obtains better results than MSAProbs, T-Coffee, Clustal [Formula: see text], and MAFFT. In addition, the parallel version is around 25 times faster than the sequential version with 32 cores, obtaining an efficiency around 80%.
Collapse
Affiliation(s)
| | - Mauro Castelli
- 1 NOVA IMS, Universidade Nova de Lisboa , Lisbon, Portugal
| | | | - Miguel A Vega-Rodríguez
- 2 Department of Computer and Communications Technologies, University of Extremadura , Caceres, Spain
| |
Collapse
|
17
|
Ali MO, El-Adl MA, Ibrahim HMM, Elseedy YY, Rizk MA, El-Khodery SA. Molecular characterization of the vitamin D receptor (VDR) gene in Holstein cows. Res Vet Sci 2018; 118:146-150. [PMID: 29433008 DOI: 10.1016/j.rvsc.2018.02.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2016] [Revised: 01/31/2018] [Accepted: 02/03/2018] [Indexed: 11/28/2022]
Abstract
Vitamin D plays a vital role in calcium homeostasis, growth, and immunoregulation. Because little is known about the vitamin D receptor (VDR) gene in cattle, the aim of the present investigation was to present the molecular characterization of exons 5 and 6 of the VDR gene in Holstein cows. DNA extraction, genomic sequencing, phylogenetic analysis, synteny mapping and single nucleotide gene polymorphism analysis of the VDR gene were performed to assess blood samples collected from 50 clinically healthy Holstein cows. The results revealed the presence of a 450-base pair (bp) nucleotide sequence that resembled exons 5 and 6 with intron 5 enclosed between these exons. Sequence alignment and phylogenetic analysis revealed a close relationship between the sequenced VDR region and that found in Hereford cattle. A close association between this region and the corresponding region in small ruminants was also documented. Moreover, a single nucleotide polymorphism (SNP) that caused the replacement of a glutamate with an arginine in the deduced amino acid sequence was detected at position 7 of exon 5. In conclusion, Holstein and Hereford cattle differ with respect to exon 5 of the VDR gene. Phylogenetic analysis of the VDR gene based on nucleotide sequence produced different results from prior analyses based on amino acid sequence.
Collapse
Affiliation(s)
- Mayar O Ali
- Department of Animal Genetics, Faculty of Veterinary Medicine, Mansoura University, Mansoura 35516, Egypt
| | - Mohamed A El-Adl
- Department of Biochemistry, Faculty of Veterinary Medicine, Mansoura University, Mansoura 35516, Egypt
| | - Hussam M M Ibrahim
- Department of Internal Medicine and Infectious Diseases, Faculty of Veterinary Medicine, Mansoura University, Mansoura 35516, Egypt
| | - Youssef Y Elseedy
- Department of Physiology, Faculty of Veterinary Medicine, Mansoura University, Mansoura 35516, Egypt
| | - Mohamed A Rizk
- Department of Internal Medicine and Infectious Diseases, Faculty of Veterinary Medicine, Mansoura University, Mansoura 35516, Egypt
| | - Sabry A El-Khodery
- Department of Internal Medicine and Infectious Diseases, Faculty of Veterinary Medicine, Mansoura University, Mansoura 35516, Egypt.
| |
Collapse
|
18
|
Disease Sequences High-Accuracy Alignment Based on the Precision Medicine. BIOMED RESEARCH INTERNATIONAL 2018; 2018:1718046. [PMID: 29682519 PMCID: PMC5842723 DOI: 10.1155/2018/1718046] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 01/18/2018] [Indexed: 11/18/2022]
Abstract
High-accuracy alignment of sequences with disease information contributes to disease treatment and prevention. The results of multiple sequence alignment depend on the parameters of the objective function, including gap open penalties (GOP), gap extension penalties (GEP), and substitution matrix (SM). Firstly, the theory parameter formulas relating to GOP, GAP, and SM are inferred, combining unaligned sequence length, number, and identity. Secondly, we tested the rationality of the theory parameter formulas, with experiment on the ClustalW and MAFFT program. In addition, we obtained a group of MAFFT program parameters according to the formulas proposed. The results of all experiments show that the SPS (sum-of-pair score) obtained from theory parameters is better than the SPS obtained from the default parameters of ClustalW and MAFFT. In both theory and practice, our method to determine the parameters is feasible and efficient. These can provide high-accuracy alignment results for precision medicine.
Collapse
|
19
|
Abstract
With the number of sequenced genomes increasing rapidly, it is impractical to perform functional and structural analyses on all individual proteins. Phylogenetic analysis employs a combination of molecular and statistical approaches to infer or estimate relationships among individuals. It provides a credible method to explore the relationship between sequence similarity and function of proteins belonging to the same family. This chapter describes a standardized framework of phylogenetic analysis to study large protein families. Bioinformatic approaches and online tools used in phylogenetic analyses are presented.
Collapse
Affiliation(s)
- Letian Song
- Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke Street West, Montreal, H4B 1R6, Quebec, Canada.
| | - Sherry Wu
- Centre for Structural and Functional Genomics, Concordia University, 7141 Sherbrooke Street West, Montreal, H4B 1R6, Quebec, Canada
| | - Adrian Tsang
- Centre for Structural and Functional Genomics, Concordia University, Montreal, QC, Canada
| |
Collapse
|
20
|
Rubio-Largo A, Vanneschi L, Castelli M, Vega-Rodriguez MA. A Characteristic-Based Framework for Multiple Sequence Aligners. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:41-51. [PMID: 27831898 DOI: 10.1109/tcyb.2016.2621129] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The multiple sequence alignment is a well-known bioinformatics problem that consists in the alignment of three or more biological sequences (protein or nucleic acid). In the literature, a number of tools have been proposed for dealing with this biological sequence alignment problem, such as progressive methods, consistency-based methods, or iterative methods; among others. These aligners often use a default parameter configuration for all the input sequences to align. However, the default configuration is not always the best choice, the alignment accuracy of the tool may be highly boosted if specific parameter configurations are used, depending on the biological characteristics of the input sequences. In this paper, we propose a characteristic-based framework for multiple sequence aligners. The idea of the framework is, given an input set of unaligned sequences, extract its characteristics and run the aligner with the best parameter configuration found for another set of unaligned sequences with similar characteristics. In order to test the framework, we have used the well-known multiple sequence comparison by log-expectation (MUSCLE) v3.8 aligner with different benchmarks, such as benchmark alignments database v3.0, protein reference alignment benchmark v4.0, and sequence alignment benchmark v1.65. The results shown that the alignment accuracy and conservation of MUSCLE might be greatly improved with the proposed framework, specially in those scenarios with a low percentage of identity. The characteristic-based framework for multiple sequence aligners is freely available for downloading at http://arco.unex.es/arl/fwk-msa/cbf-msa.zip.
Collapse
|
21
|
Chowdhury B, Garai G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 2017; 109:419-431. [PMID: 28669847 DOI: 10.1016/j.ygeno.2017.06.007] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Revised: 05/27/2017] [Accepted: 06/27/2017] [Indexed: 01/04/2023]
Abstract
Sequence alignment is an active research area in the field of bioinformatics. It is also a crucial task as it guides many other tasks like phylogenetic analysis, function, and/or structure prediction of biological macromolecules like DNA, RNA, and Protein. Proteins are the building blocks of every living organism. Although protein alignment problem has been studied for several decades, unfortunately, every available method produces alignment results differently for a single alignment problem. Multiple sequence alignment is characterized as a very high computational complex problem. Many stochastic methods, therefore, are considered for improving the accuracy of alignment. Among them, many researchers frequently use Genetic Algorithm. In this study, we have shown different types of the method applied in alignment and the recent trends in the multiobjective genetic algorithm for solving multiple sequence alignment. Many recent studies have demonstrated considerable progress in finding the alignment accuracy.
Collapse
Affiliation(s)
- Biswanath Chowdhury
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, WB, 700009, India.
| | - Gautam Garai
- Computational Sciences Division, Saha Institute of Nuclear Physics, Kolkata, WB 700064, India.
| |
Collapse
|
22
|
Guo D, Yuan E, Hu X, Wu X. Co-occurrence pattern mining based on a biological approximation scoring matrix. Pattern Anal Appl 2017. [DOI: 10.1007/s10044-017-0609-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
23
|
Abstract
Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and evolutionary processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared
k-mers (subsequences at fixed length
k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using
k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| |
Collapse
|
24
|
Rubio-Largo Á, Vega-Rodríguez MA, González-Álvarez DL. Hybrid multiobjective artificial bee colony for multiple sequence alignment. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2015.12.034] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
25
|
Potha N, Maragoudakis M, Lyras D. A biology-inspired, data mining framework for extracting patterns in sexual cyberbullying data. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2015.12.021] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
26
|
Zhu H, He Z, Jia Y. A Novel Approach to Multiple Sequence Alignment Using Multiobjective Evolutionary Algorithm Based on Decomposition. IEEE J Biomed Health Inform 2016; 20:717-27. [DOI: 10.1109/jbhi.2015.2403397] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
27
|
Al-Shatnawi M, Ahmad MO, Swamy MNS. MSAIndelFR: a scheme for multiple protein sequence alignment using information on indel flanking regions. BMC Bioinformatics 2015; 16:393. [PMID: 26597571 PMCID: PMC4657235 DOI: 10.1186/s12859-015-0826-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2015] [Accepted: 11/14/2015] [Indexed: 11/16/2022] Open
Abstract
Background The alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences is still a challenging problem. Results We propose a novel and efficient algorithm called, MSAIndelFR, for multiple sequence alignment using the information on the predicted locations of IndelFRs and the computed average log–loss values obtained from IndelFR predictors, each of which is designed for a different protein fold. We demonstrate that the introduction of a new variable gap penalty function based on the predicted locations of the IndelFRs and the computed average log–loss values into the proposed algorithm substantially improves the protein alignment accuracy. This is illustrated by evaluating the performance of the algorithm in aligning sequences belonging to the protein folds for which the IndelFR predictors already exist and by using the reference alignments of the four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABRE (SABmark 1.65). Conclusions We have proposed a novel and efficient algorithm, the MSAIndelFR algorithm, for multiple protein sequence alignment incorporating a new variable gap penalty function. It is shown that the performance of the proposed algorithm is superior to that of the most–widely used alignment algorithms, Clustal W2, Clustal Omega, Kalign2, MSAProbs, MAFFT, MUSCLE, ProbCons and Probalign, in terms of both the sum–of–pairs and total column metrics. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0826-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mufleh Al-Shatnawi
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| | - M Omair Ahmad
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| | - M N S Swamy
- Department of Electrical and Computer Engineering, Concordia University, 1455 De Maisonneuve Blvd. W., Montreal, H3G 1M8, Quebec, Canada.
| |
Collapse
|
28
|
Andreakis N, Høj L, Kearns P, Hall MR, Ericson G, Cobb RE, Gordon BR, Evans-Illidge E. Diversity of Marine-Derived Fungal Cultures Exposed by DNA Barcodes: The Algorithm Matters. PLoS One 2015; 10:e0136130. [PMID: 26308620 PMCID: PMC4550264 DOI: 10.1371/journal.pone.0136130] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2014] [Accepted: 07/29/2015] [Indexed: 01/11/2023] Open
Abstract
Marine fungi are an understudied group of eukaryotic microorganisms characterized by unresolved genealogies and unstable classification. Whereas DNA barcoding via the nuclear ribosomal internal transcribed spacer (ITS) provides a robust and rapid tool for fungal species delineation, accurate classification of fungi is often arduous given the large number of partial or unknown barcodes and misidentified isolates deposited in public databases. This situation is perpetuated by a paucity of cultivable fungal strains available for phylogenetic research linked to these data sets. We analyze ITS barcodes produced from a subsample (290) of 1781 cultured isolates of marine-derived fungi in the Bioresources Library located at the Australian Institute of Marine Science (AIMS). Our analysis revealed high levels of under-explored fungal diversity. The majority of isolates were ascomycetes including representatives of the subclasses Eurotiomycetidae, Hypocreomycetidae, Sordariomycetidae, Pleosporomycetidae, Dothideomycetidae, Xylariomycetidae and Saccharomycetidae. The phylum Basidiomycota was represented by isolates affiliated with the genera Tritirachium and Tilletiopsis. BLAST searches revealed 26 unknown OTUs and 50 isolates corresponding to previously uncultured, unidentified fungal clones. This study makes a significant addition to the availability of barcoded, culturable marine-derived fungi for detailed future genomic and physiological studies. We also demonstrate the influence of commonly used alignment algorithms and genetic distance measures on the accuracy and comparability of estimating Operational Taxonomic Units (OTUs) by the automatic barcode gap finder (ABGD) method. Large scale biodiversity screening programs that combine datasets using algorithmic OTU delineation pipelines need to ensure compatible algorithms have been used because the algorithm matters.
Collapse
Affiliation(s)
- Nikos Andreakis
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | - Lone Høj
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | - Philip Kearns
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | - Michael R. Hall
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | - Gavin Ericson
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | - Rose E. Cobb
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | - Benjamin R. Gordon
- Australian Institute of Marine Science, PMB 3, Townsville, Queensland, 4810, Australia
| | | |
Collapse
|
29
|
Garai G, Chowdhury B. A cascaded pairwise biomolecular sequence alignment technique using evolutionary algorithm. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2014.11.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
30
|
Protein sectors: statistical coupling analysis versus conservation. PLoS Comput Biol 2015; 11:e1004091. [PMID: 25723535 PMCID: PMC4344308 DOI: 10.1371/journal.pcbi.1004091] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Accepted: 12/15/2014] [Indexed: 11/19/2022] Open
Abstract
Statistical coupling analysis (SCA) is a method for analyzing multiple sequence alignments that was used to identify groups of coevolving residues termed "sectors". The method applies spectral analysis to a matrix obtained by combining correlation information with sequence conservation. It has been asserted that the protein sectors identified by SCA are functionally significant, with different sectors controlling different biochemical properties of the protein. Here we reconsider the available experimental data and note that it involves almost exclusively proteins with a single sector. We show that in this case sequence conservation is the dominating factor in SCA, and can alone be used to make statistically equivalent functional predictions. Therefore, we suggest shifting the experimental focus to proteins for which SCA identifies several sectors. Correlations in protein alignments, which have been shown to be informative in a number of independent studies, would then be less dominated by sequence conservation.
Collapse
|
31
|
Al-Shatnawi M, Ahmad MO, Swamy MNS. Prediction of Indel flanking regions in protein sequences using a variable-order Markov model. Bioinformatics 2015; 31:40-7. [PMID: 25178462 DOI: 10.1093/bioinformatics/btu556] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Insertion/deletion (indel) and amino acid substitution are two common events that lead to the evolution of and variations in protein sequences. Further, many of the human diseases and functional divergence between homologous proteins are more related to indel mutations, even though they occur less often than the substitution mutations do. A reliable identification of indels and their flanking regions is a major challenge in research related to protein evolution, structures and functions. RESULTS In this article, we propose a novel scheme to predict indel flanking regions in a protein sequence for a given protein fold, based on a variable-order Markov model. The proposed indel flanking region (IndelFR) predictors are designed based on prediction by partial match (PPM) and probabilistic suffix tree (PST), which are referred to as the PPM IndelFR and PST IndelFR predictors, respectively. The overall performance evaluation results show that the proposed predictors are able to predict IndelFRs in the protein sequences with a high accuracy and F1 measure. In addition, the results show that if one is interested only in predicting IndelFRs in protein sequences, it would be preferable to use the proposed predictors instead of HMMER 3.0 in view of the substantially superior performance of the former.
Collapse
Affiliation(s)
- Mufleh Al-Shatnawi
- Department of Electrical and Computer Engineering, Concordia University, QC H3G 2W1, Canada
| | - M Omair Ahmad
- Department of Electrical and Computer Engineering, Concordia University, QC H3G 2W1, Canada
| | - M N S Swamy
- Department of Electrical and Computer Engineering, Concordia University, QC H3G 2W1, Canada
| |
Collapse
|
32
|
Suplatov D, Voevodin V, Švedas V. Robust enzyme design: bioinformatic tools for improved protein stability. Biotechnol J 2014; 10:344-55. [PMID: 25524647 DOI: 10.1002/biot.201400150] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2014] [Revised: 09/30/2014] [Accepted: 11/04/2014] [Indexed: 01/22/2023]
Abstract
The ability of proteins and enzymes to maintain a functionally active conformation under adverse environmental conditions is an important feature of biocatalysts, vaccines, and biopharmaceutical proteins. From an evolutionary perspective, robust stability of proteins improves their biological fitness and allows for further optimization. Viewed from an industrial perspective, enzyme stability is crucial for the practical application of enzymes under the required reaction conditions. In this review, we analyze bioinformatic-driven strategies that are used to predict structural changes that can be applied to wild type proteins in order to produce more stable variants. The most commonly employed techniques can be classified into stochastic approaches, empirical or systematic rational design strategies, and design of chimeric proteins. We conclude that bioinformatic analysis can be efficiently used to study large protein superfamilies systematically as well as to predict particular structural changes which increase enzyme stability. Evolution has created a diversity of protein properties that are encoded in genomic sequences and structural data. Bioinformatics has the power to uncover this evolutionary code and provide a reproducible selection of hotspots - key residues to be mutated in order to produce more stable and functionally diverse proteins and enzymes. Further development of systematic bioinformatic procedures is needed to organize and analyze sequences and structures of proteins within large superfamilies and to link them to function, as well as to provide knowledge-based predictions for experimental evaluation.
Collapse
Affiliation(s)
- Dmitry Suplatov
- Belozersky Institute of Physicochemical Biology and Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | | | | |
Collapse
|
33
|
Three-dimensional protein structure prediction: Methods and computational strategies. Comput Biol Chem 2014; 53PB:251-276. [DOI: 10.1016/j.compbiolchem.2014.10.001] [Citation(s) in RCA: 121] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Revised: 10/03/2014] [Accepted: 10/07/2014] [Indexed: 01/01/2023]
|
34
|
Lyras DP, Metzler D. ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach. BMC Bioinformatics 2014; 15:265. [PMID: 25099134 PMCID: PMC4133627 DOI: 10.1186/1471-2105-15-265] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 07/29/2014] [Indexed: 11/16/2022] Open
Abstract
Background Obtaining an accurate sequence alignment is fundamental for consistently analyzing biological data. Although this problem may be efficiently solved when only two sequences are considered, the exact inference of the optimal alignment easily gets computationally intractable for the multiple sequence alignment case. To cope with the high computational expenses, approximate heuristic methods have been proposed that address the problem indirectly by progressively aligning the sequences in pairs according to their relatedness. These methods however are not flexible to change the alignment of an already aligned group of sequences in the view of new data, resulting thus in compromises on the quality of the deriving alignment. In this paper we present ReformAlign, a novel meta-alignment approach that may significantly improve on the quality of the deriving alignments from popular aligners. We call ReformAlign a meta-aligner as it requires an initial alignment, for which a variety of alignment programs can be used. The main idea behind ReformAlign is quite straightforward: at first, an existing alignment is used to construct a standard profile which summarizes the initial alignment and then all sequences are individually re-aligned against the formed profile. From each sequence-profile comparison, the alignment of each sequence against the profile is recorded and the final alignment is indirectly inferred by merging all the individual sub-alignments into a unified set. The employment of ReformAlign may often result in alignments which are significantly more accurate than the starting alignments. Results We evaluated the effect of ReformAlign on the generated alignments from ten leading alignment methods using real data of variable size and sequence identity. The experimental results suggest that the proposed meta-aligner approach may often lead to statistically significant more accurate alignments. Furthermore, we show that ReformAlign results in more substantial improvement in cases where the starting alignment is of relatively inferior quality or when the input sequences are harder to align. Conclusions The proposed profile-based meta-alignment approach seems to be a promising and computationally efficient method that can be combined with practically all popular alignment methods and may lead to significant improvements in the generated alignments. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-265) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dimitrios P Lyras
- Faculty of Biology, Department II, Ludwig-Maximilians Universität München, Planegg-Martinsried 82152, Germany.
| | | |
Collapse
|
35
|
Kaya M, Sarhan A, Alhajj R. Multiple sequence alignment with affine gap by using multi-objective genetic algorithm. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2014; 114:38-49. [PMID: 24534604 DOI: 10.1016/j.cmpb.2014.01.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2013] [Revised: 11/29/2013] [Accepted: 01/12/2014] [Indexed: 06/03/2023]
Abstract
Multiple sequence alignment is of central importance to bioinformatics and computational biology. Although a large number of algorithms for computing a multiple sequence alignment have been designed, the efficient computation of highly accurate and statistically significant multiple alignments is still a challenge. In this paper, we propose an efficient method by using multi-objective genetic algorithm (MSAGMOGA) to discover optimal alignments with affine gap in multiple sequence data. The main advantage of our approach is that a large number of tradeoff (i.e., non-dominated) alignments can be obtained by a single run with respect to conflicting objectives: affine gap penalty minimization and similarity and support maximization. To the best of our knowledge, this is the first effort with three objectives in this direction. The proposed method can be applied to any data set with a sequential character. Furthermore, it allows any choice of similarity measures for finding alignments. By analyzing the obtained optimal alignments, the decision maker can understand the tradeoff between the objectives. We compared our method with the three well-known multiple sequence alignment methods, MUSCLE, SAGA and MSA-GA. As the first of them is a progressive method, and the other two are based on evolutionary algorithms. Experiments on the BAliBASE 2.0 database were conducted and the results confirm that MSAGMOGA obtains the results with better accuracy statistical significance compared with the three well-known methods in aligning multiple sequence alignment with affine gap. The proposed method also finds solutions faster than the other evolutionary approaches mentioned above.
Collapse
Affiliation(s)
- Mehmet Kaya
- Department of Computer Engineering, Firat University, 23119 Elazig, Turkey.
| | - Abdullah Sarhan
- Department of Computer Science, University of Calgary, Calgary, AB, Canada.
| | - Reda Alhajj
- Department of Computer Science, University of Calgary, Calgary, AB, Canada; Department of Computer Science, Global University, Beirut, Lebanon.
| |
Collapse
|
36
|
Abstract
Background Sequence alignment has become an indispensable tool in modern molecular biology research, and probabilistic sequence alignment models have been shown to provide an effective framework for building accurate sequence alignment tools. One such example is the pair hidden Markov model (pair-HMM), which has been especially popular in comparative sequence analysis for several reasons, including their effectiveness in modeling and detecting sequence homology, model simplicity, and the existence of efficient algorithms for applying the model to sequence alignment problems. However, despite these advantages, pair-HMMs also have a number of practical limitations that may degrade their alignment performance or render them unsuitable for certain alignment tasks. Results In this work, we propose a novel scheme for comparing and aligning biological sequences that can effectively address the shortcomings of the traditional pair-HMMs. The proposed scheme is based on a simple message-passing approach, where messages are exchanged between neighboring symbol pairs that may be potentially aligned in the optimal sequence alignment. The message-passing process yields probabilistic symbol alignment confidence scores, which may be used for predicting the optimal alignment that maximizes the expected number of correctly aligned symbol pairs. Conclusions Extensive performance evaluation on protein alignment benchmark datasets shows that the proposed message-passing scheme clearly outperforms the traditional pair-HMM-based approach, in terms of both alignment accuracy and computational efficiency. Furthermore, the proposed scheme is numerically robust and amenable to massive parallelization.
Collapse
|
37
|
Abstract
Sequence alignment remains a fundamental task in bioinformatics. The literature contains programs that employ a host of exact and heuristic strategies available in computer science. Probcons was the first program to construct maximum expected accuracy sequence alignments with hidden Markov models and at the time of its publication achieved the highest accuracies on standard protein multiple alignment benchmarks. Probalign followed this strategy except that it used a partition function approach instead of hidden Markov models. Several programs employing both strategies have been published since then. In this chapter we describe Probcons and Probalign.
Collapse
Affiliation(s)
- Usman Roshan
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
38
|
Réblová M, Réblová K. RNA secondary structure, an important bioinformatics tool to enhance multiple sequence alignment: a case study (Sordariomycetes, Fungi). Mycol Prog 2012. [DOI: 10.1007/s11557-012-0836-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
39
|
Plyusnin I, Holm L. Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics 2012; 13:64. [PMID: 22540977 PMCID: PMC3375188 DOI: 10.1186/1471-2105-13-64] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2011] [Accepted: 04/29/2012] [Indexed: 12/03/2022] Open
Abstract
Background Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark. Results Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal. Conclusions This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).
Collapse
Affiliation(s)
- Ilya Plyusnin
- Institute of Biotechnology, University of Helsinki, P,O, Box 56, Viikinkaari 5, Helsinki, Finland.
| | | |
Collapse
|
40
|
Jagadeesh Chandra Bose R, van der Aalst WM. Process diagnostics using trace alignment: Opportunities, issues, and challenges. INFORM SYST 2012. [DOI: 10.1016/j.is.2011.08.003] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
41
|
Wang CK, Broder U, Weeratunga SK, Gasser RB, Loukas A, Hofmann A. SBAL: a practical tool to generate and edit structure-based amino acid sequence alignments. ACTA ACUST UNITED AC 2012; 28:1026-7. [PMID: 22332239 DOI: 10.1093/bioinformatics/bts035] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
SUMMARY Both alignment generation and visualization are important processes for producing biologically meaningful sequence alignments. Computational tools that combine reliable, automated and semi-automated approaches to produce secondary structure-based alignments with an appropriate visualization of the results are rare. We have developed SBAL, a tool to generate and edit secondary structure-based sequence alignments. It is easy to install and provides a user-friendly interface. Sequence alignments are displayed, with secondary structure assignments mapped to their corresponding regions in the sequence by using a simple colour scheme. The algorithm implemented for automated and semi-automated secondary structure-based alignment calculations shows a comparable performance to existing software. AVAILABILITY AND IMPLEMENTATION SBAL has been implemented in Java to provide cross-platform compatibility. SBAL is freely available to academic users at http://www.structuralchemistry.org/pcsb/. Users will be asked for their name, institution and email address. A manual can also be downloaded from this site. The software, manual and test sets are also available as supplementary material. CONTACT conan.wang@griffith.edu.au SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Conan K Wang
- Structural Chemistry Program, Eskitis Institute for Cell and Molecular Therapies, Griffith University, Brisbane, Qld 4111, Australia.
| | | | | | | | | | | |
Collapse
|
42
|
|
43
|
Magrane M. UniProt Knowledgebase: a hub of integrated protein data. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar009. [PMID: 21447597 PMCID: PMC3070428 DOI: 10.1093/database/bar009] [Citation(s) in RCA: 1057] [Impact Index Per Article: 81.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
The UniProt Knowledgebase (UniProtKB) acts as a central hub of protein knowledge by providing a unified view of protein sequence and functional information. Manual and automatic annotation procedures are used to add data directly to the database while extensive cross-referencing to more than 120 external databases provides access to additional relevant information in more specialized data collections. UniProtKB also integrates a range of data from other resources. All information is attributed to its original source, allowing users to trace the provenance of all data. The UniProt Consortium is committed to using and promoting common data exchange formats and technologies, and UniProtKB data is made freely available in a range of formats to facilitate integration with other databases. Database URL:http://www.uniprot.org/
Collapse
Affiliation(s)
- Michele Magrane
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
44
|
Zhang Z, Wang Y, Wang L, Gao P. The combined effects of amino acid substitutions and indels on the evolution of structure within protein families. PLoS One 2010; 5:e14316. [PMID: 21179197 PMCID: PMC3001449 DOI: 10.1371/journal.pone.0014316] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2010] [Accepted: 11/16/2010] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND In the process of protein evolution, sequence variations within protein families can cause changes in protein structures and functions. However, structures tend to be more conserved than sequences and functions. This leads to an intriguing question: what is the evolutionary mechanism by which sequence variations produce structural changes? To investigate this question, we focused on the most common types of sequence variations: amino acid substitutions and insertions/deletions (indels). Here their combined effects on protein structure evolution within protein families are studied. RESULTS Sequence-structure correlation analysis on 75 homologous structure families (from SCOP) that contain 20 or more non-redundant structures shows that in most of these families there is, statistically, a bilinear correlation between the amount of substitutions and indels versus the degree of structure variations. Bilinear regression of percent sequence non-identity (PNI) and standardized number of gaps (SNG) versus RMSD was performed. The coefficients from the regression analysis could be used to estimate the structure changes caused by each unit of substitution (structural substitution sensitivity, SSS) and by each unit of indel (structural indel sensitivity, SIDS). An analysis on 52 families with high bilinear fitting multiple correlation coefficients and statistically significant regression coefficients showed that SSS is mainly constrained by disulfide bonds, which almost have no effects on SIDS. CONCLUSIONS Structural changes in homologous protein families could be rationally explained by a bilinear model combining amino acid substitutions and indels. These results may further improve our understanding of the evolutionary mechanisms of protein structures.
Collapse
Affiliation(s)
- Zheng Zhang
- State Key Laboratory of Microbial Technology, Shandong University, Jinan, Shandong, China
| | - Yuxiao Wang
- State Key Laboratory of Microbial Technology, Shandong University, Jinan, Shandong, China
- Division of Basic Science, UT Southwestern, Dallas, Texas, United States of America
| | - Lushan Wang
- State Key Laboratory of Microbial Technology, Shandong University, Jinan, Shandong, China
- * E-mail: (LW); (PG)
| | - Peiji Gao
- State Key Laboratory of Microbial Technology, Shandong University, Jinan, Shandong, China
- * E-mail: (LW); (PG)
| |
Collapse
|
45
|
Abstract
The metrization of the space of neural responses is an ongoing research program seeking to find natural ways to describe, in geometrical terms, the sets of possible activities in the brain. One component of this program is spike metrics—notions of distance between two spike trains recorded from a neuron. Alignment spike metrics work by identifying “equivalent” spikes in both trains. We present an alignment spike metric having [Formula: see text] underlying geometrical structure; the [Formula: see text] version is Euclidean and is suitable for further embedding in Euclidean spaces by multidimensional scaling methods or related procedures. We show how to implement a fast algorithm for the computation of this metric based on bipartite graph matching theory.
Collapse
Affiliation(s)
| | - Brad A. Seiler
- Center for Studies in Physics and Biology, Rockefeller University, New York, New York 10065, U.S.A., and Harvard University, Faculty of Arts and Sciences, Cambridge, MA 02138, U.S.A
| | - Marcelo O. Magnasco
- Center for Studies in Physics and Biology, Rockefeller University, New York, New York 10065, U.S.A
| |
Collapse
|
46
|
Sahraeian SME, Yoon BJ. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res 2010; 38:4917-28. [PMID: 20413579 PMCID: PMC2926610 DOI: 10.1093/nar/gkq255] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Revised: 03/25/2010] [Accepted: 03/26/2010] [Indexed: 11/13/2022] Open
Abstract
Accurate tools for multiple sequence alignment (MSA) are essential for comparative studies of the function and structure of biological sequences. However, it is very challenging to develop a computationally efficient algorithm that can consistently predict accurate alignments for various types of sequence sets. In this article, we introduce PicXAA (Probabilistic Maximum Accuracy Alignment), a probabilistic non-progressive alignment algorithm that aims to find protein alignments with maximum expected accuracy. PicXAA greedily builds up the multiple alignment from sequence regions with high local similarities, thereby yielding an accurate global alignment that effectively grasps the local similarities among sequences. Evaluations on several widely used benchmark sets show that PicXAA constantly yields accurate alignment results on a wide range of reference sets, with especially remarkable improvements over other leading algorithms on sequence sets with local similarities. PicXAA source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.
Collapse
Affiliation(s)
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
47
|
Di Lena P, Margara L. Optimal global alignment of signals by maximization of Pearson correlation. INFORM PROCESS LETT 2010. [DOI: 10.1016/j.ipl.2010.05.024] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
48
|
Notredame C. Computing multiple sequence/structure alignments with the T-coffee package. ACTA ACUST UNITED AC 2010; Chapter 3:3.8.1-3.8.25. [PMID: 20205190 DOI: 10.1002/0471250953.bi0308s29] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
In this unit, we describe assembly of a multiple sequence alignment using the T-Coffee package. T-Coffee is much more flexible than most related methods (e.g., ClustalW) because it makes it possible to combine many alternative alignments into a single one, based on an estimate of consistency between these alignments. This strategy can be especially useful when one has to decide among the output produced by several alternative methods.
Collapse
|
49
|
Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW, Misof B. Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Front Zool 2010; 7:10. [PMID: 20356385 PMCID: PMC2867768 DOI: 10.1186/1742-9994-7-10] [Citation(s) in RCA: 150] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2009] [Accepted: 03/31/2010] [Indexed: 12/16/2022] Open
Abstract
Background Methods of alignment masking, which refers to the technique of excluding alignment blocks prior to tree reconstructions, have been successful in improving the signal-to-noise ratio in sequence alignments. However, the lack of formally well defined methods to identify randomness in sequence alignments has prevented a routine application of alignment masking. In this study, we compared the effects on tree reconstructions of the most commonly used profiling method (GBLOCKS) which uses a predefined set of rules in combination with alignment masking, with a new profiling approach (ALISCORE) based on Monte Carlo resampling within a sliding window, using different data sets and alignment methods. While the GBLOCKS approach excludes variable sections above a certain threshold which choice is left arbitrary, the ALISCORE algorithm is free of a priori rating of parameter space and therefore more objective. Results ALISCORE was successfully extended to amino acids using a proportional model and empirical substitution matrices to score randomness in multiple sequence alignments. A complex bootstrap resampling leads to an even distribution of scores of randomly similar sequences to assess randomness of the observed sequence similarity. Testing performance on real data, both masking methods, GBLOCKS and ALISCORE, helped to improve tree resolution. The sliding window approach was less sensitive to different alignments of identical data sets and performed equally well on all data sets. Concurrently, ALISCORE is capable of dealing with different substitution patterns and heterogeneous base composition. ALISCORE and the most relaxed GBLOCKS gap parameter setting performed best on all data sets. Correspondingly, Neighbor-Net analyses showed the most decrease in conflict. Conclusions Alignment masking improves signal-to-noise ratio in multiple sequence alignments prior to phylogenetic reconstruction. Given the robust performance of alignment profiling, alignment masking should routinely be used to improve tree reconstructions. Parametric methods of alignment profiling can be easily extended to more complex likelihood based models of sequence evolution which opens the possibility of further improvements.
Collapse
Affiliation(s)
- Patrick Kück
- Zoologisches Forschungsmuseum A, Koenig, Adenauerallee 160, 53113 Bonn, Germany.
| | | | | | | | | | | | | |
Collapse
|
50
|
Yoo PD, Zhou BB, Zomaya AY. A modular kernel approach for integrative analysis of protein domain boundaries. BMC Genomics 2009; 10 Suppl 3:S21. [PMID: 19958485 PMCID: PMC2788374 DOI: 10.1186/1471-2164-10-s3-s21] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND In this paper, we introduce a novel inter-range interaction integrated approach for protein domain boundary prediction. It involves (1) the design of modular kernel algorithm, which is able to effectively exploit the information of non-local interactions in amino acids, and (2) the development of a novel profile that can provide suitable information to the algorithm. One of the key features of this profiling technique is the use of multiple structural alignments of remote homologues to create an extended sequence profile and combines the structural information with suitable chemical information that plays an important role in protein stability. This profile can capture the sequence characteristics of an entire structural superfamily and extend a range of profiles generated from sequence similarity alone. RESULTS Our novel profile that combines homology information with hydrophobicity from SARAH1 scale was successful in providing more structural and chemical information. In addition, the modular approach adopted in our algorithm proved to be effective in capturing information from non-local interactions. Our approach achieved 82.1%, 50.9% and 31.5% accuracies for one-domain, two-domain, and three- and more domain proteins respectively. CONCLUSION The experimental results in this study are encouraging, however, more work is need to extend it to a broader range of applications. We are currently developing a novel interactive (human in the loop) profiling that can provide information from more distantly related homology. This approach will further enhance the current study.
Collapse
Affiliation(s)
- Paul D Yoo
- Advanced Networks Research Group, School of Information Technologies (J12), the University of Sydney, NSW 2006, Australia.
| | | | | |
Collapse
|