1
|
Tao Y, Ge S. A distribution-guided Mapper algorithm. BMC Bioinformatics 2025; 26:73. [PMID: 40045218 PMCID: PMC11881416 DOI: 10.1186/s12859-025-06085-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 02/14/2025] [Indexed: 03/09/2025] Open
Abstract
BACKGROUND The Mapper algorithm is an essential tool for exploring the data shape in topological data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a Reeb graph of a dataset. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of a dataset, especially when the underlying structure is complex. RESULTS In this work, we introduce a distribution-guided Mapper algorithm named D-Mapper, which utilizes the property of the probability model and data intrinsic characteristics to generate density-guided covers and provide enhanced topological features. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistent homology to measure the performance of Mapper-type algorithms. Our numerical experiments indicate that the D-Mapper outperforms the classic Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequence dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both the vertical and horizontal evolutionary processes of the viruses. Our code is available at https://github.com/ShufeiGe/D-Mapper . CONCLUSION The D-Mapper algorithm can generate covers from data based on a probability model. This work demonstrates the power of fusing probabilistic models with Mapper algorithms.
Collapse
Affiliation(s)
- Yuyang Tao
- Institute of Mathematical Sciences, ShanghaiTech University, 393 Middle Huaxia Road, 201210, Shanghai, China
| | - Shufei Ge
- Institute of Mathematical Sciences, ShanghaiTech University, 393 Middle Huaxia Road, 201210, Shanghai, China.
| |
Collapse
|
2
|
Hajieghrari B, Niazi A. Phylogenetic and Evolutionary Analysis of Plant Small RNA 2'-O-Methyltransferase (HEN1) Protein Family. J Mol Evol 2023:10.1007/s00239-023-10109-0. [PMID: 37191719 DOI: 10.1007/s00239-023-10109-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 04/05/2023] [Indexed: 05/17/2023]
Abstract
HUA ENHANCER 1 (HEN1) is a pivotal mediator in protecting sRNAs from 3'-end uridylation and 3' to 5' exonuclease-mediated degradation in plants. Here, we investigated the pattern of the HEN1 protein family evolutionary history and possible relationships in the plant lineages using protein sequence analyses and conserved motifs composition, functional domain identification, architecture, and phylogenetic tree reconstruction and evolutionary history inference. According to our results, HEN1 protein sequences bear several highly conserved motifs in plant species retained during the evolution from their ancestor. However, several motifs are present only in Gymnosperms and Angiosperms. A similar trend showed for their domain architecture. At the same time, phylogenetic analysis revealed the grouping of the HEN1 proteins in the three main super clads. In addition, the Neighbor-net network analysis result provides some nodes have multiple parents indicating a few conflicting signals in the data, which is not the consequence of sampling error, the effect of the selected model, or the estimation method. By reconciling the protein and species tree, we considered the gene duplications in several given species and found 170 duplication events in the evolution of HEN1 in the plant lineages. According to our analysis, the main HEN1 superclass mostly showed orthologous sequences that illustrate the vertically transmitting of HEN1 to the main lines. However, in both orthologous and paralogs, we predicted insignificant structural deviations. Our analysis implies that small local structural changes that occur continuously during the folds can moderate the changes created in the sequence. According to our results, we proposed a hypothetical model and evolutionary trajectory for the HEN1 protein family in the plant kingdom.
Collapse
Affiliation(s)
- Behzad Hajieghrari
- Department of Agricultural Biotechnology, College of Agriculture, Jahrom University, P.O. Box 74135-111, Jahrom, Islamic Republic of Iran.
| | - Ali Niazi
- Institute of Biotechnology, School of Agriculture, Shiraz University, Shiraz, Islamic Republic of Iran
| |
Collapse
|
3
|
Kille B, Balaji A, Sedlazeck FJ, Nute M, Treangen TJ. Multiple genome alignment in the telomere-to-telomere assembly era. Genome Biol 2022; 23:182. [PMID: 36038949 PMCID: PMC9421119 DOI: 10.1186/s13059-022-02735-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Accepted: 07/21/2022] [Indexed: 01/22/2023] Open
Abstract
With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Michael Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
4
|
In silico analysis and expression profiling of Expansin A4, BURP domain protein RD22-like and E6-like genes associated with fiber quality in cotton. Mol Biol Rep 2022; 49:5521-5534. [PMID: 35553343 DOI: 10.1007/s11033-022-07432-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 03/25/2022] [Indexed: 10/18/2022]
Abstract
BACKGROUND To supply high-quality cotton fibre for the textile industry, the development of long, strong and fine fibre cotton varieties is imperative. An interlinked approach was used to comprehend the role of fibre genes by analyzing interspecific progenies of cotton species. Wild Gossypium species and races are rich source of genetic polymorphism due to environmental dispersal and continuous natural selection. These genetic resources hold mass of outclass genes that can be used in cotton improvement breeding programs to exploit possible traits such as fibre quality, abiotic stress tolerance, and disease and insect resistance. Therefore, use of new molecular techniques such as genomics, transcriptomics and bioinformatics is very important to utilize the genetic potential of wild species in cotton improvement programs. METHODS Interspecific lines and Gossypium species used in the study were grown at Central Cotton Research Institute (CCRI), Multan. After retrieving DNA sequence of the genes from NCBI, the primers for gene expression and full-length gene sequence were designed. Expression profiling of Expansin A4, BURP Domain protein RD22-like and E6-like fibre genes was performed through Real Time PCR. BLAST and DNA sequence alignment was conducted for sequence comparison of interspecific lines and Gossypium species. Different in silico analysis were used for characterization of fibre genes and identification of cis acting promoter elements in promoter region. RESULTS Variable expression of genes related to fibre development was observed at different stages. BLAST and DNA sequence alignment demonstrated resemblance of interspecific lines with G. hirsutum. In silico analysis on the sequence data also confirmed the role of Expansin A4, BURP Domain protein RD22-like and E6-like fibre genes in fibre development. Genetic engineering is also recommended by transferring E6-like, Expansin A4 and BURP Domain RD22-like genes in local cotton cultivars. Similarly, several stress tolerant and light responsive cis acting elements were identified through promotor analysis, which may contribute for fibre development in the breeding programs. CONCLUSION Expansin A4, BURP Domain RD22-like and E6-like have positive role in fibre development with variable expression at fiber length and strength associated stages.
Collapse
|
5
|
Lajevardy SA, Kargari M. Developing new genetic algorithm based on integer programming for multiple sequence alignment. Soft comput 2022. [DOI: 10.1007/s00500-022-06790-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
6
|
Abstract
Molecular evolutionary analyses require computationally intensive steps such as aligning multiple sequences, optimizing substitution models, inferring evolutionary trees, testing phylogenies by bootstrap analysis, and estimating divergence times. With the rise of large genomic data sets, phylogenomics is imposing a big carbon footprint on the environment with consequences for the planet's health. Electronic waste and energy usage are large environmental issues. Fortunately, innovative methods and heuristics are available to shrink the carbon footprint, presenting researchers with opportunities to lower the environmental costs and greener evolutionary computing. Green computing will also enable greater scientific rigor and encourage broader participation in big data analytics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Department of Biology, Temple University, Philadelphia, PA, USA
| |
Collapse
|
7
|
Xia X. Post-Alignment Adjustment and Its Automation. Genes (Basel) 2021; 12:genes12111809. [PMID: 34828415 PMCID: PMC8623120 DOI: 10.3390/genes12111809] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Revised: 11/13/2021] [Accepted: 11/16/2021] [Indexed: 11/16/2022] Open
Abstract
Multiple sequence alignment (MSA) is the basis for almost all sequence comparison and molecular phylogenetic inferences. Large-scale genomic analyses are typically associated with automated progressive MSA without subsequent manual adjustment, which itself is often error-prone because of the lack of a consistent and explicit criterion. Here, I outlined several commonly encountered alignment errors that cannot be avoided by progressive MSA for nucleotide, amino acid, and codon sequences. Methods that could be automated to fix such alignment errors were then presented. I emphasized the utility of position weight matrix as a new tool for MSA refinement and illustrated its usage by refining the MSA of nucleotide and amino acid sequences. The main advantages of the position weight matrix approach include (1) its use of information from all sequences, in contrast to other commonly used methods based on pairwise alignment scores and inconsistency measures, and (2) its speedy computation, making it suitable for a large number of long viral genomic sequences.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, Marie-Curie Private, Ottawa, ON K1N 9A7, Canada; ; Tel.: +1-613-562-5718
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| |
Collapse
|
8
|
Li Y. Sequence Alignment with Q-Learning Based on the Actor-Critic Model. ACM T ASIAN LOW-RESO 2021. [DOI: 10.1145/3433540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Multiple sequence alignment methods refer to a series of algorithmic solutions for the alignment of evolutionary-related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. In this article, we propose a method with Q-learning based on the Actor-Critic model for sequence alignment. We transform the sequence alignment problem into an agent's autonomous learning process. In this process, the reward of the possible next action taken is calculated, and the cumulative reward of the entire process is calculated. The results show that the method we propose is better than the gene algorithm and the dynamic programming method.
Collapse
Affiliation(s)
- Yarong Li
- The Experimental High School Attached to Beijing Normal University, Beijing, China
| |
Collapse
|
9
|
Echevarría LY, De la Riva I, Venegas PJ, Rojas-Runjaic FJM, R Dias I, Castroviejo-Fisher S. Total evidence and sensitivity phylogenetic analyses of egg-brooding frogs (Anura: Hemiphractidae). Cladistics 2021; 37:375-401. [PMID: 34478194 DOI: 10.1111/cla.12447] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/10/2020] [Indexed: 01/06/2023] Open
Abstract
We study the phylogenetic relationships of egg-brooding frogs, a group of 118 neotropical species, unique among anurans by having embryos with large bell-shaped gills and females carrying their eggs on the dorsum, exposed or inside a pouch. We assembled a total evidence dataset of published and newly generated data containing 51 phenotypic characters and DNA sequences of 20 loci for 143 hemiphractids and 127 outgroup terminals. We performed six analytical strategies combining different optimality criteria (parsimony and maximum likelihood), alignment methods (tree- and similarity-alignment), and three different indel coding schemes (fifth character state, unknown nucleotide, and presence/absence characters matrix). Furthermore, we analyzed a subset of the total evidence dataset to evaluate the impact of phenotypic characters on hemiphractid phylogenetic relationships. Our main results include: (i) monophyly of Hemiphractidae and its six genera for all our analyses, novel relationships among hemiphractid genera, and non-monophyly of Hemiphractinae according to our preferred phylogenetic hypothesis; (ii) non-monophyly of current supraspecific taxonomies of Gastrotheca, an updated taxonomy is provided; (iii) previous differences among studies were mainly caused by differences in analytical factors, not by differences in character/taxon sampling; (iv) optimality criteria, alignment method, and indel coding caused differences among optimal topologies, in that order of degree; (v) in most cases, parsimony analyses are more sensitive to the addition of phenotypic data than maximum likelihood analyses; (vi) adding phenotypic data resulted in an increase of shared clades for most analyses.
Collapse
Affiliation(s)
- Lourdes Y Echevarría
- Laboratório de Sistemática de Vertebrados, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre, RS, 90619-900, Brazil.,División de Herpetología-Centro de Ornitología y Biodiversidad (CORBIDI), Urb. Huertos de San Antonio, Santa Rita No. 105 Of. 202, Surco, Lima, Perú
| | - Ignacio De la Riva
- Museo Nacional de Ciencias Naturales-CSIC, C/José Gutiérrez Abascal 2, Madrid, 28006, Spain
| | - Pablo J Venegas
- División de Herpetología-Centro de Ornitología y Biodiversidad (CORBIDI), Urb. Huertos de San Antonio, Santa Rita No. 105 Of. 202, Surco, Lima, Perú
| | | | - Iuri R Dias
- Graduate Program in Zoology, Universidade Estadual de Santa Cruz, Rodovia Jorge Amado, km 16, Ilhéus, Bahia, 45662-900, Brazil
| | - Santiago Castroviejo-Fisher
- Laboratório de Sistemática de Vertebrados, Pontifícia Universidade Católica do Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre, RS, 90619-900, Brazil.,Department of Herpetology, American Museum of Natural History, New York, NY, 10024, USA
| |
Collapse
|
10
|
McInerney TW, Fulton-Howard B, Patterson C, Paliwal D, Jermiin LS, Patel HR, Pa J, Swerdlow RH, Goate A, Easteal S, Andrews SJ. A globally diverse reference alignment and panel for imputation of mitochondrial DNA variants. BMC Bioinformatics 2021; 22:417. [PMID: 34470617 PMCID: PMC8409003 DOI: 10.1186/s12859-021-04337-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 08/16/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Variation in mitochondrial DNA (mtDNA) identified by genotyping microarrays or by sequencing only the hypervariable regions of the genome may be insufficient to reliably assign mitochondrial genomes to phylogenetic lineages or haplogroups. This lack of resolution can limit functional and clinical interpretation of a substantial body of existing mtDNA data. To address this limitation, we developed and evaluated a large, curated reference alignment of complete mtDNA sequences as part of a pipeline for imputing missing mtDNA single nucleotide variants (mtSNVs). We call our reference alignment and pipeline MitoImpute. RESULTS We aligned the sequences of 36,960 complete human mitochondrial genomes downloaded from GenBank, filtered and controlled for quality. These sequences were reformatted for use in imputation software, IMPUTE2. We assessed the imputation accuracy of MitoImpute by measuring haplogroup and genotype concordance in data from the 1000 Genomes Project and the Alzheimer's Disease Neuroimaging Initiative (ADNI). The mean improvement of haplogroup assignment in the 1000 Genomes samples was 42.7% (Matthew's correlation coefficient = 0.64). In the ADNI cohort, we imputed missing single nucleotide variants. CONCLUSION These results show that our reference alignment and panel can be used to impute missing mtSNVs in existing data obtained from using microarrays, thereby broadening the scope of functional and clinical investigation of mtDNA. This improvement may be particularly useful in studies where participants have been recruited over time and mtDNA data obtained using different methods, enabling better integration of early data collected using less accurate methods with more recent sequence data.
Collapse
Affiliation(s)
- Tim W McInerney
- John Curtin School of Medical Research, Australian National University, Australian Capital Territory, Canberra, Australia
| | - Brian Fulton-Howard
- Genetics and Genomic Sciences, Ronald M. Loeb Center for Alzheimer's Disease, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY, 10029, USA
| | - Christopher Patterson
- Keck School of Medicine, Mark and Mary Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA, USA
- Department of Neurology, Alzheimer's Disease Research Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Devashi Paliwal
- John Curtin School of Medical Research, Australian National University, Australian Capital Territory, Canberra, Australia
| | - Lars S Jermiin
- CSIRO Land and Water, Commonwealth Scientific Industrial and Research Organization, Acton, ACT, 2601, Australia
- Research School of Biology, Australian National University, Canberra, ACT, 2601, Australia
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
- Earth Institute, University College Dublin, Belfield, Dublin 4, Ireland
| | - Hardip R Patel
- John Curtin School of Medical Research, Australian National University, Australian Capital Territory, Canberra, Australia
| | - Judy Pa
- Keck School of Medicine, Mark and Mary Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA, USA
- Department of Neurology, Alzheimer's Disease Research Center, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Russell H Swerdlow
- Department of Neurology, Alzheimer's Disease Center, University of Kansas, Fairway, KS, USA
| | - Alison Goate
- Genetics and Genomic Sciences, Ronald M. Loeb Center for Alzheimer's Disease, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY, 10029, USA
| | - Simon Easteal
- John Curtin School of Medical Research, Australian National University, Australian Capital Territory, Canberra, Australia
| | - Shea J Andrews
- Genetics and Genomic Sciences, Ronald M. Loeb Center for Alzheimer's Disease, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY, 10029, USA.
| |
Collapse
|
11
|
Sharma S, Kumar S. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps. NATURE COMPUTATIONAL SCIENCE 2021; 1:573-577. [PMID: 34734192 PMCID: PMC8560003 DOI: 10.1038/s43588-021-00129-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 08/13/2021] [Indexed: 12/30/2022]
Abstract
Felsenstein's bootstrap approach is widely used to assess confidence in species relationships inferred from multiple sequence alignments. It resamples sites randomly with replacement to build alignment replicates of the same size as the original alignment and infers a phylogeny from each replicate dataset. The proportion of phylogenies recovering the same grouping of species is its bootstrap confidence limit. But, standard bootstrap imposes a high computational burden in applications involving long sequence alignments. Here, we introduce the bag of little bootstraps approach to phylogenetics, bootstrapping only a few little samples, each containing a small subset of sites. We report that the median bagging of bootstrap confidence limits from little samples produces confidence in inferred species relationships similar to standard bootstrap but in a fraction of computational time and memory. Therefore, the little bootstraps approach can potentially enhance the rigor, efficiency, and parallelization of big data phylogenomic analyses.
Collapse
Affiliation(s)
- Sudip Sharma
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA
- Department of Biology, Temple University, Philadelphia, PA
| | - Sudhir Kumar
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA
- Department of Biology, Temple University, Philadelphia, PA
- Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
12
|
Description and molecular analysis of an Italian population of Centrorhynchus globo caudatus (Zeder, 1800) Lühe, 1911 (Acanthocephala: Centrorhynchidae) from Falco tinnunculus (Falconidae) and Buteo buteo (Accipitridae). J Helminthol 2020; 94:e207. [PMID: 33118894 DOI: 10.1017/s0022149x20000887] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Centrorhynchus globocaudatus (Zeder, 1800) Lühe, 1911 (Centrorhynchidae) was reported in birds of prey. Our population from Falco tinnunculus Linnaeus (Falconidae) and Buteo buteo Linnaeus (Accipitridae) in northern Italy was morphologically distinct from others described elsewhere. The worms are elongate and cylindrical. Proboscis long, apically truncated and bare, with wider base and variably faint constriction at point of attachment of receptacle. Large anterior hooks well rooted; posterior spiniform hooks with reduced roots; transitional hooks with scutiform roots in-between. Four tubular cement glands extend into prominent ducts overlapping a large Saefftigen's pouch. Bursa large, with sensory plates. Vagina with laterally slit orifice in sub-ventral pit of globular terminal extension. Thick-shelled eggs ovoid without polar prolongation of fertilization membrane. In our specimens, proboscis hooks, receptacle, male reproductive system, and lemnisci especially in males varied in size from those from Ukraine, India, Egypt, Kyrgystan, Russia, Georgia, Armenia and Asian Soviet Republics. Our description of the Italian specimens includes new morphological information supported by scanning electron microscopy and microscope images, molecular analysis and energy dispersive X-ray analysis (EDXA) of hooks. Additional new details of proboscis hook roots, micropores and micropore distribution are described. Metal composition of hooks (EDXA) demonstrated high levels of calcium and phosphorous, and high levels of sulphur in core and cortical layers of eggs. The molecular profile based on sequences of 18S and cytochrome c oxidase 1 genes is also provided, as well as phylogenetic reconstructions including all available sequences of the family Centrorhynchidae, although further sequences are needed in order to clarify their phylogenetic relationships.
Collapse
|
13
|
Noah KE, Hao J, Li L, Sun X, Foley B, Yang Q, Xia X. Major Revisions in Arthropod Phylogeny Through Improved Supermatrix, With Support for Two Possible Waves of Land Invasion by Chelicerates. Evol Bioinform Online 2020; 16:1176934320903735. [PMID: 32076367 PMCID: PMC7003163 DOI: 10.1177/1176934320903735] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 01/02/2020] [Indexed: 01/04/2023] Open
Abstract
Deep phylogeny involving arthropod lineages is difficult to recover because the erosion of phylogenetic signals over time leads to unreliable multiple sequence alignment (MSA) and subsequent phylogenetic reconstruction. One way to alleviate the problem is to assemble a large number of gene sequences to compensate for the weakness in each individual gene. Such an approach has led to many robustly supported but contradictory phylogenies. A close examination shows that the supermatrix approach often suffers from two shortcomings. The first is that MSA is rarely checked for reliability and, as will be illustrated, can be poor. The second is that, to alleviate the problem of homoplasy at the third codon position of protein-coding genes due to convergent evolution of nucleotide frequencies, phylogeneticists may remove or degenerate the third codon position but may do it improperly and introduce new biases. We performed extensive reanalysis of one of such "big data" sets to highlight these two problems, and demonstrated the power and benefits of correcting or alleviating these problems. Our results support a new group with Xiphosura and Arachnopulmonata (Tetrapulmonata + Scorpiones) as sister taxa. This favors a new hypothesis in which the ancestor of Xiphosura and the extinct Eurypterida (sea scorpions, of which many later forms lived in brackish or freshwater) returned to the sea after the initial chelicerate invasion of land. Our phylogeny is supported even with the original data but processed with a new "principled" codon degeneration. We also show that removing the 1673 codon sites with both AGN and UCN codons (encoding serine) in our alignment can partially reconcile discrepancies between nucleotide-based and AA-based tree, partly because two sequences, one with AGN and the other with UCN, would be identical at the amino acid level but quite different at the nucleotide level.
Collapse
Affiliation(s)
| | - Jiasheng Hao
- College of Life Sciences, Anhui Normal University, Wuhu, China
| | - Luyan Li
- Nanjing Institute of Geology and Paleontology, Chinese Academy of Sciences, Nanjing, China
| | - Xiaoyan Sun
- Nanjing Institute of Geology and Paleontology, Chinese Academy of Sciences, Nanjing, China
| | - Brian Foley
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Qun Yang
- Nanjing Institute of Geology and Paleontology, Chinese Academy of Sciences, Nanjing, China
| | - Xuhua Xia
- Department of Biology, University of Ottawa, Ottawa, ON, Canada
- Ottawa Institute of Systems Biology, University of Ottawa, Ottawa, ON, Canada
| |
Collapse
|
14
|
Cornetti L, Fields PD, Van Damme K, Ebert D. A fossil-calibrated phylogenomic analysis of Daphnia and the Daphniidae. Mol Phylogenet Evol 2019; 137:250-262. [DOI: 10.1016/j.ympev.2019.05.018] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 05/03/2019] [Accepted: 05/20/2019] [Indexed: 11/16/2022]
|
15
|
Lambert MÈ, Arsenault J, Delisle B, Audet P, Poljak Z, D'Allaire S. Impact of alignment algorithm on the estimation of pairwise genetic similarity of porcine reproductive and respiratory syndrome virus (PRRSV). BMC Vet Res 2019; 15:135. [PMID: 31068211 PMCID: PMC6505299 DOI: 10.1186/s12917-019-1890-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 04/29/2019] [Indexed: 12/19/2022] Open
Abstract
Background Porcine reproductive and respiratory syndrome (PRRS) is a major threat to the swine industry. It is caused by the PRRS virus (PRRSV). Determination and comparison of the nucleotide sequences of PRRSV strains provides useful information in support of control initiatives or epidemiological studies on transmission patterns. The alignment of sequences is the first step in analyzing sequence data, with multiple algorithms being available, but little is known on the impact of this methodological choice. Here, a study was conducted to evaluate the impact of different alignment algorithms on the resulting aligned sequence dataset and on practical issues when applied to a large field database of PRRSV open reading frame (ORF) 5 sequences collected in Quebec, Canada, from 2010 to 2014. Five multiple sequence alignment programs were compared: Clustal W, Clustal Omega, Muscle, T-Coffee and MAFFT. Results The resulting alignments showed very similar results in terms of average pairwise genetic similarity, proportion of pairwise comparisons having ≥97.5% genetic similarity and sum of pairs (SP) score, except for T-Coffee where increased length of aligned datasets as well as limitation to handle large datasets were observed. Conclusions Based on efficiency at minimizing the number of gaps in different dataset sizes with default open gap values as well as the capability to handle a large number of sequences in a timely manner, the use of Clustal Omega might be recommended for the management of PRRSV extensive database for both research and surveillance purposes.
Collapse
Affiliation(s)
- Marie-Ève Lambert
- Laboratoire d'épidémiologie et de médecine porcine (LEMP), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada. .,Swine and Poultry Infectious Diseases Research Center (CRIPA), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada.
| | - Julie Arsenault
- Laboratoire d'épidémiologie et de médecine porcine (LEMP), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada.,Swine and Poultry Infectious Diseases Research Center (CRIPA), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada
| | - Benjamin Delisle
- Laboratoire d'épidémiologie et de médecine porcine (LEMP), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada.,Swine and Poultry Infectious Diseases Research Center (CRIPA), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada
| | - Pascal Audet
- Laboratoire d'épidémiologie et de médecine porcine (LEMP), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada.,Swine and Poultry Infectious Diseases Research Center (CRIPA), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada
| | - Zvonimir Poljak
- Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, Ontario, Canada
| | - Sylvie D'Allaire
- Laboratoire d'épidémiologie et de médecine porcine (LEMP), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada.,Swine and Poultry Infectious Diseases Research Center (CRIPA), Faculty of Veterinary Medicine, Université de Montréal, St. Hyacinthe, Quebec, Canada
| |
Collapse
|
16
|
Effects of missing data and data type on phylotranscriptomic analysis of stony corals (Cnidaria: Anthozoa: Scleractinia). Mol Phylogenet Evol 2019; 134:12-23. [DOI: 10.1016/j.ympev.2019.01.012] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 01/11/2019] [Accepted: 01/17/2019] [Indexed: 01/28/2023]
|
17
|
Abstract
Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.
Collapse
|
18
|
Chowdhury B, Garai G. A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 2017; 109:419-431. [PMID: 28669847 DOI: 10.1016/j.ygeno.2017.06.007] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Revised: 05/27/2017] [Accepted: 06/27/2017] [Indexed: 01/04/2023]
Abstract
Sequence alignment is an active research area in the field of bioinformatics. It is also a crucial task as it guides many other tasks like phylogenetic analysis, function, and/or structure prediction of biological macromolecules like DNA, RNA, and Protein. Proteins are the building blocks of every living organism. Although protein alignment problem has been studied for several decades, unfortunately, every available method produces alignment results differently for a single alignment problem. Multiple sequence alignment is characterized as a very high computational complex problem. Many stochastic methods, therefore, are considered for improving the accuracy of alignment. Among them, many researchers frequently use Genetic Algorithm. In this study, we have shown different types of the method applied in alignment and the recent trends in the multiobjective genetic algorithm for solving multiple sequence alignment. Many recent studies have demonstrated considerable progress in finding the alignment accuracy.
Collapse
Affiliation(s)
- Biswanath Chowdhury
- Department of Biophysics, Molecular Biology and Bioinformatics, University of Calcutta, Kolkata, WB, 700009, India.
| | - Gautam Garai
- Computational Sciences Division, Saha Institute of Nuclear Physics, Kolkata, WB 700064, India.
| |
Collapse
|
19
|
Arribas-Gil A, Matias C. A time warping approach to multiple sequence alignment. Stat Appl Genet Mol Biol 2017; 16:133-144. [PMID: 28593899 DOI: 10.1515/sagmb-2016-0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
We propose an approach for multiple sequence alignment (MSA) derived from the dynamic time warping viewpoint and recent techniques of curve synchronization developed in the context of functional data analysis. Starting from pairwise alignments of all the sequences (viewed as paths in a certain space), we construct a median path that represents the MSA we are looking for. We establish a proof of concept that our method could be an interesting ingredient to include into refined MSA techniques. We present a simple synthetic experiment as well as the study of a benchmark dataset, together with comparisons with 2 widely used MSA softwares.
Collapse
|
20
|
Ayad LAK, Pissis SP. MARS: improving multiple circular sequence alignment using refined sequences. BMC Genomics 2017; 18:86. [PMID: 28088189 PMCID: PMC5237495 DOI: 10.1186/s12864-016-3477-5] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 12/26/2016] [Indexed: 12/04/2022] Open
Abstract
Background A fundamental assumption of all widely-used multiple sequence alignment techniques is that the left- and right-most positions of the input sequences are relevant to the alignment. However, the position where a sequence starts or ends can be totally arbitrary due to a number of reasons: arbitrariness in the linearisation (sequencing) of a circular molecular structure; or inconsistencies introduced into sequence databases due to different linearisation standards. These scenarios are relevant, for instance, in the process of multiple sequence alignment of mitochondrial DNA, viroid, viral or other genomes, which have a circular molecular structure. A solution for these inconsistencies would be to identify a suitable rotation (cyclic shift) for each sequence; these refined sequences may in turn lead to improved multiple sequence alignments using the preferred multiple sequence alignment program. Results We present MARS, a new heuristic method for improving Multiple circular sequence Alignment using Refined Sequences. MARS was implemented in the C++ programming language as a program to compute the rotations (cyclic shifts) required to best align a set of input sequences. Experimental results, using real and synthetic data, show that MARS improves the alignments, with respect to standard genetic measures and the inferred maximum-likelihood-based phylogenies, and outperforms state-of-the-art methods both in terms of accuracy and efficiency. Our results show, among others, that the average pairwise distance in the multiple sequence alignment of a dataset of widely-studied mitochondrial DNA sequences is reduced by around 5% when MARS is applied before a multiple sequence alignment is performed. Conclusions Analysing multiple sequences simultaneously is fundamental in biological research and multiple sequence alignment has been found to be a popular method for this task. Conventional alignment techniques cannot be used effectively when the position where sequences start is arbitrary. We present here a method, which can be used in conjunction with any multiple sequence alignment program, to address this problem effectively and efficiently.
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK
| | - Solon P Pissis
- Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| |
Collapse
|
21
|
Chiner-Oms A, González-Candelas F. EvalMSA: A Program to Evaluate Multiple Sequence Alignments and Detect Outliers. Evol Bioinform Online 2016; 12:277-284. [PMID: 27920488 PMCID: PMC5127606 DOI: 10.4137/ebo.s40583] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Revised: 10/02/2016] [Accepted: 10/05/2016] [Indexed: 12/01/2022] Open
Abstract
We present EvalMSA, a software tool for evaluating and detecting outliers in multiple sequence alignments (MSAs). This tool allows the identification of divergent sequences in MSAs by scoring the contribution of each row in the alignment to its quality using a sum-of-pair-based method and additional analyses. Our main goal is to provide users with objective data in order to take informed decisions about the relevance and/or pertinence of including/retaining a particular sequence in an MSA. EvalMSA is written in standard Perl and also uses some routines from the statistical language R. Therefore, it is necessary to install the R-base package in order to get full functionality. Binary packages are freely available from http://sourceforge.net/projects/evalmsa/for Linux and Windows.
Collapse
Affiliation(s)
- Alvaro Chiner-Oms
- Joint Research Unit "Infection and Public Health" FISABIO, Cavanilles Institute for Biodiversity and Evolutionary Biology, University of Valencia, Paterna, Valencia, Spain.; CIBER in Epidemiology and Public Health, Madrid, Spain
| | - Fernando González-Candelas
- Joint Research Unit "Infection and Public Health" FISABIO, Cavanilles Institute for Biodiversity and Evolutionary Biology, University of Valencia, Paterna, Valencia, Spain.; CIBER in Epidemiology and Public Health, Madrid, Spain
| |
Collapse
|
22
|
Whittle CA, Extavour CG. Refuting the hypothesis that the acquisition of germ plasm accelerates animal evolution. Nat Commun 2016; 7:12637. [PMID: 27577604 PMCID: PMC5013649 DOI: 10.1038/ncomms12637] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 07/20/2016] [Indexed: 02/04/2023] Open
Abstract
Primordial germ cells (PGCs) give rise to the germ line in animals. PGCs are specified during embryogenesis either by an ancestral mechanism of cell-cell signalling (induction) or by a derived mechanism of maternally provided germ plasm (preformation). Recently, a hypothesis was set forth purporting that germ plasm liberates selective constraint and accelerates an organism's protein sequence evolution, especially for genes from early developmental stages, thereby leading to animal species radiations; empirical validation has been claimed in vertebrates. Here we present findings from global rates of protein evolution in vertebrates and invertebrates refuting this hypothesis. Contrary to assertions of the hypothesis, we find no effect of preformation on protein sequence evolution, the evolutionary rates of early-stage developmental genes, or on species diversification. We conclude that the hypothesis is mechanistically implausible, and our multi-faceted analysis shows no empirical support for any of its predictions.
Collapse
Affiliation(s)
- Carrie A. Whittle
- Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA
| | - Cassandra G. Extavour
- Department of Organismic and Evolutionary Biology, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA
- Department of Molecular and Cellular Biology, Harvard University, 16 Divinity Avenue, Cambridge, Massachusetts 02138, USA
| |
Collapse
|
23
|
Coppola CJ, C Ramaker R, Mendenhall EM. Identification and function of enhancers in the human genome. Hum Mol Genet 2016; 25:R190-R197. [PMID: 27402881 DOI: 10.1093/hmg/ddw216] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Accepted: 06/30/2016] [Indexed: 12/31/2022] Open
Abstract
The study of gene regulation has rapidly advanced by leveraging next-generation sequencing to identify and characterize the cis and trans elements that are critical for defining cell identity. These advances have paralleled a movement towards whole genome sequencing in clinics. These two tracks have increasingly synergized to underscore the importance of cis-regulatory elements in development as well produce countless studies implicating these elements in human disease. Other studies have emphasized the clinical phenotypes associated with variation or mutations in trans factors, including non-coding RNAs and chromatin regulators. These studies highlight the importance of obtaining a comprehensive understanding of mammalian gene regulation for predicting the impact of genetic variation on patient phenotypes. Currently lagging behind the generation of vast datasets and annotations is our ability to examine these putative elements in the dynamic context of a developing organism.
Collapse
Affiliation(s)
| | - Ryne C Ramaker
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA University of Alabama at Birmingham, Birmingham, AL, USA
| | - Eric M Mendenhall
- University of Alabama in Huntsville, Huntsville, AL, USA HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| |
Collapse
|
24
|
Xia X. PhyPA: Phylogenetic method with pairwise sequence alignment outperforms likelihood methods in phylogenetics involving highly diverged sequences. Mol Phylogenet Evol 2016; 102:331-43. [PMID: 27377322 DOI: 10.1016/j.ympev.2016.07.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Accepted: 07/01/2016] [Indexed: 11/30/2022]
Abstract
While pairwise sequence alignment (PSA) by dynamic programming is guaranteed to generate one of the optimal alignments, multiple sequence alignment (MSA) of highly divergent sequences often results in poorly aligned sequences, plaguing all subsequent phylogenetic analysis. One way to avoid this problem is to use only PSA to reconstruct phylogenetic trees, which can only be done with distance-based methods. I compared the accuracy of this new computational approach (named PhyPA for phylogenetics by pairwise alignment) against the maximum likelihood method using MSA (the ML+MSA approach), based on nucleotide, amino acid and codon sequences simulated with different topologies and tree lengths. I present a surprising discovery that the fast PhyPA method consistently outperforms the slow ML+MSA approach for highly diverged sequences even when all optimization options were turned on for the ML+MSA approach. Only when sequences are not highly diverged (i.e., when a reliable MSA can be obtained) does the ML+MSA approach outperforms PhyPA. The true topologies are always recovered by ML with the true alignment from the simulation. However, with MSA derived from alignment programs such as MAFFT or MUSCLE, the recovered topology consistently has higher likelihood than that for the true topology. Thus, the failure to recover the true topology by the ML+MSA is not because of insufficient search of tree space, but by the distortion of phylogenetic signal by MSA methods. I have implemented in DAMBE PhyPA and two approaches making use of multi-gene data sets to derive phylogenetic support for subtrees equivalent to resampling techniques such as bootstrapping and jackknifing.
Collapse
Affiliation(s)
- Xuhua Xia
- Department of Biology, University of Ottawa, 30 Marie Curie, Ottawa K1N 6N5, Canada; Ottawa Institute of Systems Biology, 451 Smyth Road, Ottawa, ON K1H 8M5, Canada.
| |
Collapse
|
25
|
Ezawa K. Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map. BMC Bioinformatics 2016; 17:133. [PMID: 26992851 PMCID: PMC4799563 DOI: 10.1186/s12859-016-0945-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 02/11/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art aligners typically make. For this purpose, we here introduce two tools: the complete-likelihood score and the position-shift map. RESULTS The logarithm of the total probability of a MSA under a stochastic model of sequence evolution along a time axis via substitutions, insertions and deletions (called the "complete-likelihood score" here) can serve as an ideal score of the MSA. A position-shift map, which maps the difference in each residue's position between two MSAs onto one of them, can clearly visualize where and how MSA errors occurred and help disentangle composite errors. To characterize MSA errors using these tools, we constructed three sets of simulated MSAs of selectively neutral mammalian DNA sequences, with small, moderate and large divergences, under a stochastic evolutionary model with an empirically common power-law insertion/deletion length distribution. Then, we reconstructed MSAs using MAFFT and Prank as representative state-of-the-art single-optimum-search aligners. About 40-99% of the hundreds of thousands of gapped segments were involved in alignment errors. In a substantial fraction, from about 1/4 to over 3/4, of erroneously reconstructed segments, reconstructed MSAs by each aligner showed complete-likelihood scores not lower than those of the true MSAs. Out of the remaining errors, a majority by an iterative option of MAFFT showed discrepancies between the aligner-specific score and the complete-likelihood score, and a majority by Prank seemed due to inadequate exploration of the MSA space. Analyses by position-shift maps indicated that true MSAs are in considerable neighborhoods of reconstructed MSAs in about 80-99% of the erroneous segments for small and moderate divergences, but in only a minority for large divergences. CONCLUSIONS The results of this study suggest that measures to further improve the accuracy of reconstructed MSAs would substantially differ depending on the types of aligners. They also re-emphasize the importance of obtaining a probability distribution of fairly likely MSAs, instead of just searching for a single optimum MSA.
Collapse
Affiliation(s)
- Kiyoshi Ezawa
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Iizuka, 820-8502, Japan. .,Department of Biology and Biochemistry, University of Houston, Houston, TX, 77204-5001, USA.
| |
Collapse
|
26
|
Over-represented pairwise 16S rRNA gene sequence distance levels among prokaryotes. ANN MICROBIOL 2016. [DOI: 10.1007/s13213-015-1107-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
27
|
Galpert D, del Río S, Herrera F, Ancede-Gallardo E, Antunes A, Agüero-Chapin G. An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species. BIOMED RESEARCH INTERNATIONAL 2015; 2015:748681. [PMID: 26605337 PMCID: PMC4641943 DOI: 10.1155/2015/748681] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2015] [Revised: 07/26/2015] [Accepted: 08/20/2015] [Indexed: 11/17/2022]
Abstract
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.
Collapse
Affiliation(s)
- Deborah Galpert
- Departamento de Ciencias de la Computación, Universidad Central “Marta Abreu” de Las Villas (UCLV), 54830 Santa Clara, Cuba
| | - Sara del Río
- Department of Computer Science and Artificial Intelligence, Research Center on Information and Communications Technology (CITIC-UGR), University of Granada, 18071 Granada, Spain
| | - Francisco Herrera
- Department of Computer Science and Artificial Intelligence, Research Center on Information and Communications Technology (CITIC-UGR), University of Granada, 18071 Granada, Spain
| | - Evys Ancede-Gallardo
- Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas (UCLV), 54830 Santa Clara, Cuba
| | - Agostinho Antunes
- Centro Interdisciplinar de Investigação Marinha e Ambiental (CIMAR/CIIMAR), Universidade do Porto, Rua dos Bragas 177, 4050-123 Porto, Portugal
- Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal
| | - Guillermin Agüero-Chapin
- Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas (UCLV), 54830 Santa Clara, Cuba
- Centro Interdisciplinar de Investigação Marinha e Ambiental (CIMAR/CIIMAR), Universidade do Porto, Rua dos Bragas 177, 4050-123 Porto, Portugal
| |
Collapse
|
28
|
Ibarra-Laclette E, Zamudio-Hernández F, Pérez-Torres CA, Albert VA, Ramírez-Chávez E, Molina-Torres J, Fernández-Cortes A, Calderón-Vázquez C, Olivares-Romero JL, Herrera-Estrella A, Herrera-Estrella L. De novo sequencing and analysis of Lophophora williamsii transcriptome, and searching for putative genes involved in mescaline biosynthesis. BMC Genomics 2015; 16:657. [PMID: 26330142 PMCID: PMC4557841 DOI: 10.1186/s12864-015-1821-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 08/07/2015] [Indexed: 12/04/2022] Open
Abstract
Background Lophophora williamsii (commonly named peyote) is a small, spineless cactus with psychoactive alkaloids, particularly mescaline. Peyote utilizes crassulacean acid metabolism (CAM), an alternative form of photosynthesis that exists in succulents such as cacti and other desert plants. Therefore, its transcriptome can be considered an important resource for future research focused on understanding how these plants make more efficient use of water in marginal environments and also for research focused on better understanding of the overall mechanisms leading to production of plant natural products and secondary metabolites. Results In this study, two cDNA libraries were generated from L. williamsii. These libraries, representing buttons (tops of stems) and roots were sequenced using different sequencing platforms (GS-FLX, GS-Junior and PGM, respectively). A total of 5,541,550 raw reads were generated, which were assembled into 63,704 unigenes with an average length of 564.04 bp. A total of 25,149 unigenes (62.19 %) was annotated using public databases. 681 unigenes were found to be differentially expressed when comparing the two libraries, where 400 were preferentially expressed in buttons and 281 in roots. Some of the major alkaloids, including mescaline, were identified by GC-MS and relevant metabolic pathways were reconstructed using the Kyoto encyclopedia of genes and genomes database (KEGG). Subsequently, the expression patterns of preferentially expressed genes putatively involved in mescaline production were examined and validated by qRT-PCR. Conclusions High throughput transcriptome sequencing (RNA-seq) analysis allowed us to efficiently identify candidate genes involved in mescaline biosynthetic pathway in L. williamsii; these included tyrosine/DOPA decarboxylase, hydroxylases, and O-methyltransferases. This study sets the theoretical foundation for bioassay design directed at confirming the participation of these genes in mescaline production. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1821-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Enrique Ibarra-Laclette
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y Estudios Avanzados del IPN, 36500, Irapuato, Guanajuato, México. .,Red de Estudios Moleculares Avanzados, Instituto de Ecología A.C., 91070, Xalapa, Veracruz, México.
| | - Flor Zamudio-Hernández
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y Estudios Avanzados del IPN, 36500, Irapuato, Guanajuato, México.
| | - Claudia Anahí Pérez-Torres
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y Estudios Avanzados del IPN, 36500, Irapuato, Guanajuato, México. .,Red de Estudios Moleculares Avanzados, Instituto de Ecología A.C., 91070, Xalapa, Veracruz, México. .,Investigador Cátedra CONACyT, Instituto de Ecología A.C., 91070, Xalapa, Veracruz, México.
| | - Victor A Albert
- Department of Biological Sciences, University at Buffalo, Buffalo, New York, 14260, USA.
| | - Enrique Ramírez-Chávez
- Departamento de Biotecnología y Bioquímica, Unidad Irapuato, Centro de Investigación y de Estudios Avanzados del IPN, 36821, Irapuato, Guanajuato, México.
| | - Jorge Molina-Torres
- Departamento de Biotecnología y Bioquímica, Unidad Irapuato, Centro de Investigación y de Estudios Avanzados del IPN, 36821, Irapuato, Guanajuato, México.
| | - Araceli Fernández-Cortes
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y Estudios Avanzados del IPN, 36500, Irapuato, Guanajuato, México.
| | - Carlos Calderón-Vázquez
- Centro Interdisciplinario de Investigación para el Desarrollo Integral Regional (CIIDIR), Instituto Politécnico Nacional, 81000, Guasave, Sinaloa, México.
| | | | - Alfredo Herrera-Estrella
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y Estudios Avanzados del IPN, 36500, Irapuato, Guanajuato, México.
| | - Luis Herrera-Estrella
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y Estudios Avanzados del IPN, 36500, Irapuato, Guanajuato, México.
| |
Collapse
|
29
|
Homology-independent metrics for comparative genomics. Comput Struct Biotechnol J 2015; 13:352-7. [PMID: 26029354 PMCID: PMC4446528 DOI: 10.1016/j.csbj.2015.04.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Revised: 04/06/2015] [Accepted: 04/18/2015] [Indexed: 11/24/2022] Open
Abstract
A mainstream procedure to analyze the wealth of genomic data available nowadays is the detection of homologous regions shared across genomes, followed by the extraction of biological information from the patterns of conservation and variation observed in such regions. Although of pivotal importance, comparative genomic procedures that rely on homology inference are obviously not applicable if no homologous regions are detectable. This fact excludes a considerable portion of “genomic dark matter” with no significant similarity — and, consequently, no inferred homology to any other known sequence — from several downstream comparative genomic methods. In this review we compile several sequence metrics that do not rely on homology inference and can be used to compare nucleotide sequences and extract biologically meaningful information from them. These metrics comprise several compositional parameters calculated from sequence data alone, such as GC content, dinucleotide odds ratio, and several codon bias metrics. They also share other interesting properties, such as pervasiveness (patterns persist on smaller scales) and phylogenetic signal. We also cite examples where these homology-independent metrics have been successfully applied to support several bioinformatics challenges, such as taxonomic classification of biological sequences without homology inference. They where also used to detect higher-order patterns of interactions in biological systems, ranging from detecting coevolutionary trends between the genomes of viruses and their hosts to characterization of gene pools of entire microbial communities. We argue that, if correctly understood and applied, homology-independent metrics can add important layers of biological information in comparative genomic studies without prior homology inference.
Collapse
|
30
|
Affiliation(s)
| | - David Posada
- Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo 36310, Spain
| |
Collapse
|
31
|
Abstract
Background Sequence alignment has become an indispensable tool in modern molecular biology research, and probabilistic sequence alignment models have been shown to provide an effective framework for building accurate sequence alignment tools. One such example is the pair hidden Markov model (pair-HMM), which has been especially popular in comparative sequence analysis for several reasons, including their effectiveness in modeling and detecting sequence homology, model simplicity, and the existence of efficient algorithms for applying the model to sequence alignment problems. However, despite these advantages, pair-HMMs also have a number of practical limitations that may degrade their alignment performance or render them unsuitable for certain alignment tasks. Results In this work, we propose a novel scheme for comparing and aligning biological sequences that can effectively address the shortcomings of the traditional pair-HMMs. The proposed scheme is based on a simple message-passing approach, where messages are exchanged between neighboring symbol pairs that may be potentially aligned in the optimal sequence alignment. The message-passing process yields probabilistic symbol alignment confidence scores, which may be used for predicting the optimal alignment that maximizes the expected number of correctly aligned symbol pairs. Conclusions Extensive performance evaluation on protein alignment benchmark datasets shows that the proposed message-passing scheme clearly outperforms the traditional pair-HMM-based approach, in terms of both alignment accuracy and computational efficiency. Furthermore, the proposed scheme is numerically robust and amenable to massive parallelization.
Collapse
|
32
|
Sahraeian SME, Yoon BJ. PicXAA: a probabilistic scheme for finding the maximum expected accuracy alignment of multiple biological sequences. Methods Mol Biol 2014; 1079:203-210. [PMID: 24170404 DOI: 10.1007/978-1-62703-646-7_13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
PicXAA is a probabilistic nonprogressive alignment algorithm that finds protein (or DNA) multiple sequence alignments with maximum expected accuracy. PicXAA greedily builds up the alignment from sequence regions with high local similarity, thereby yielding an accurate global alignment that effectively captures the local similarities across sequences. PicXAA constantly yields accurate alignment results on a wide range of reference sets that have different characteristics, with especially remarkable improvements over other leading algorithms on sequence sets with high local similarities. In this chapter, we describe the overall alignment strategy used in PicXAA and discuss several important considerations for effective deployment of the algorithm.
Collapse
|
33
|
Abstract
SUMMARY We developed PSAR-Align, a multiple sequence realignment tool that can refine a given multiple sequence alignment based on suboptimal alignments generated by probabilistic sampling. Our evaluation demonstrated that PSAR-Align is able to improve the results from various multiple sequence alignment tools. AVAILABILITY AND IMPLEMENTATION The PSAR-Align source code (implemented mainly in C++) is freely available for download at http://bioen-compbio.bioen.illinois.edu/PSAR-Align.
Collapse
Affiliation(s)
- Jaebum Kim
- Department of Animal Biotechnology, UBITA Center for Biotechnology Research (CBRU), Konkuk University, Seoul 143-701, Korea, Department of Bioengineering and Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | |
Collapse
|
34
|
Manning T, Sleator RD, Walsh P. Naturally selecting solutions: the use of genetic algorithms in bioinformatics. Bioengineered 2013; 4:266-78. [PMID: 23222169 PMCID: PMC3813526 DOI: 10.4161/bioe.23041] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2012] [Revised: 11/26/2012] [Accepted: 11/28/2012] [Indexed: 11/19/2022] Open
Abstract
For decades, computer scientists have looked to nature for biologically inspired solutions to computational problems; ranging from robotic control to scheduling optimization. Paradoxically, as we move deeper into the post-genomics era, the reverse is occurring, as biologists and bioinformaticians look to computational techniques, to solve a variety of biological problems. One of the most common biologically inspired techniques are genetic algorithms (GAs), which take the Darwinian concept of natural selection as the driving force behind systems for solving real world problems, including those in the bioinformatics domain. Herein, we provide an overview of genetic algorithms and survey some of the most recent applications of this approach to bioinformatics based problems.
Collapse
Affiliation(s)
- Timmy Manning
- Department of Computer Science; Cork Institute of Technology; Cork, Ireland
| | - Roy D Sleator
- Department of Biological Sciences; Cork Institute of Technology; Cork, Ireland
| | - Paul Walsh
- Department of Computer Science; Cork Institute of Technology; Cork, Ireland
| |
Collapse
|
35
|
Warnow T. Large-Scale Multiple Sequence Alignment and Phylogeny Estimation. MODELS AND ALGORITHMS FOR GENOME EVOLUTION 2013. [DOI: 10.1007/978-1-4471-5298-9_6] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
36
|
Wang M, Ye Y, Tang H. A de Bruijn graph approach to the quantification of closely-related genomes in a microbial community. J Comput Biol 2012; 19:814-25. [PMID: 22697249 DOI: 10.1089/cmb.2012.0058] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
Collapse
Affiliation(s)
- Mingjie Wang
- School of Informatics and Computing, Indiana University, Bloomington, IN 47405, USA
| | | | | |
Collapse
|
37
|
oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3-GENES GENOMES GENETICS 2012; 2:987-1002. [PMID: 22973536 PMCID: PMC3429929 DOI: 10.1534/g3.112.003202] [Citation(s) in RCA: 230] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2012] [Accepted: 06/11/2012] [Indexed: 01/12/2023]
Abstract
oPOSSUM-3 is a web-accessible software system for identification of over-represented transcription factor binding sites (TFBS) and TFBS families in either DNA sequences of co-expressed genes or sequences generated from high-throughput methods, such as ChIP-Seq. Validation of the system with known sets of co-regulated genes and published ChIP-Seq data demonstrates the capacity for oPOSSUM-3 to identify mediating transcription factors (TF) for co-regulated genes or co-recovered sequences. oPOSSUM-3 is available at http://opossum.cisreg.ca.
Collapse
|
38
|
Spatio-temporal dynamics of endophyte diversity in the canopy of European ash (Fraxinus excelsior). Mycol Prog 2012. [DOI: 10.1007/s11557-012-0835-9] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
39
|
Sun H, Buhler JD. PhyLAT: a phylogenetic local alignment tool. Bioinformatics 2012; 28:1336-44. [PMID: 22492645 PMCID: PMC3465089 DOI: 10.1093/bioinformatics/bts158] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2011] [Revised: 03/29/2012] [Accepted: 03/30/2012] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The expansion of DNA sequencing capacity has enabled the sequencing of whole genomes from a number of related species. These genomes can be combined in a multiple alignment that provides useful information about the evolutionary history at each genomic locus. One area in which evolutionary information can productively be exploited is in aligning a new sequence to a database of existing, aligned genomes. However, existing high-throughput alignment tools are not designed to work effectively with multiple genome alignments. RESULTS We introduce PhyLAT, the phylogenetic local alignment tool, to compute local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyLAT uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. It combines a probabilistic approach to alignment with seeding and expansion heuristics to accelerate discovery of significant alignments. We provide evidence, using alignments of human chromosome 22 against a five-species alignment from the UCSC Genome Browser database, that PhyLAT's alignments are more accurate than those of other commonly used programs, including BLAST, POY, MAFFT, MUSCLE and CLUSTAL. PhyLAT also identifies more alignments in coding DNA than does pairwise alignment alone. Finally, our tool determines the evolutionary relationship of query sequences to the database more accurately than do POY, RAxML, EPA or pplacer.
Collapse
Affiliation(s)
- Hongtao Sun
- Department of Computer Science and Engineering, Washington University, Saint Louis, MO 63130, USA.
| | | |
Collapse
|
40
|
Unterseher M, Petzold A, Schnittler M. Xerotolerant foliar endophytic fungi of Populus euphratica from the Tarim River basin, Central China are conspecific to endophytic ITS phylotypes of Populus tremula from temperate Europe. FUNGAL DIVERS 2012. [DOI: 10.1007/s13225-012-0167-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
41
|
Erb I, González-Vallinas JR, Bussotti G, Blanco E, Eyras E, Notredame C. Use of ChIP-Seq data for the design of a multiple promoter-alignment method. Nucleic Acids Res 2012; 40:e52. [PMID: 22230796 PMCID: PMC3326335 DOI: 10.1093/nar/gkr1292] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
We address the challenge of regulatory sequence alignment with a new method, Pro-Coffee, a multiple aligner specifically designed for homologous promoter regions. Pro-Coffee uses a dinucleotide substitution matrix estimated on alignments of functional binding sites from TRANSFAC. We designed a validation framework using several thousand families of orthologous promoters. This dataset was used to evaluate the accuracy for predicting true human orthologs among their paralogs. We found that whereas other methods achieve on average 73.5% accuracy, and 77.6% when trained on that same dataset, the figure goes up to 80.4% for Pro-Coffee. We then applied a novel validation procedure based on multi-species ChIP-seq data. Trained and untrained methods were tested for their capacity to correctly align experimentally detected binding sites. Whereas the average number of correctly aligned sites for two transcription factors is 284 for default methods and 316 for trained methods, Pro-Coffee achieves 331, 16.5% above the default average. We find a high correlation between a method's performance when classifying orthologs and its ability to correctly align proven binding sites. Not only has this interesting biological consequences, it also allows us to conclude that any method that is trained on the ortholog data set will result in functionally more informative alignments.
Collapse
Affiliation(s)
- Ionas Erb
- Bioinformatics and Genomics program, Centre for Genomic Regulation and UPF, 08003 Barcelona, Spain
| | | | | | | | | | | |
Collapse
|
42
|
Dissanayake R, Oshida T. The systematics of the dusky striped squirrel,Funambulus sublineatus(Waterhouse, 1838) (Rodentia: Sciuridae) and its relationships to Layard's squirrel,Funambulus layardiBlyth, 1849. J NAT HIST 2012. [DOI: 10.1080/00222933.2011.626126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
43
|
Löytynoja A. Alignment methods: strategies, challenges, benchmarking, and comparative overview. Methods Mol Biol 2012; 855:203-35. [PMID: 22407710 DOI: 10.1007/978-1-61779-582-4_7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Comparative evolutionary analyses of molecular sequences are solely based on the identities and differences detected between homologous characters. Errors in this homology statement, that is errors in the alignment of the sequences, are likely to lead to errors in the downstream analyses. Sequence alignment and phylogenetic inference are tightly connected and many popular alignment programs use the phylogeny to divide the alignment problem into smaller tasks. They then neglect the phylogenetic tree, however, and produce alignments that are not evolutionarily meaningful. The use of phylogeny-aware methods reduces the error but the resulting alignments, with evolutionarily correct representation of homology, can challenge the existing practices and methods for viewing and visualising the sequences. The inter-dependency of alignment and phylogeny can be resolved by joint estimation of the two; methods based on statistical models allow for inferring the alignment parameters from the data and correctly take into account the uncertainty of the solution but remain computationally challenging. Widely used alignment methods are based on heuristic algorithms and unlikely to find globally optimal solutions. The whole concept of one correct alignment for the sequences is questionable, however, as there typically exist vast numbers of alternative, roughly equally good alignments that should also be considered. This uncertainty is hidden by many popular alignment programs and is rarely correctly taken into account in the downstream analyses. The quest for finding and improving the alignment solution is complicated by the lack of suitable measures of alignment goodness. The difficulty of comparing alternative solutions also affects benchmarks of alignment methods and the results strongly depend on the measure used. As the effects of alignment error cannot be predicted, comparing the alignments' performance in downstream analyses is recommended.
Collapse
Affiliation(s)
- Ari Löytynoja
- European Bioinformatics Institute (EMBL), Hinxton, UK.
| |
Collapse
|
44
|
Abstract
Hepatitis C virus (HCV) is a Flavivirus with a positive-sense, single-stranded RNA genome of about 9,600 nucleotides. It is a major cause of liver disease, infecting almost 200 million people all over the world. Similarly to most RNA viruses, HCV displays very high levels of genetic diversity which have been used to differentiate six major genotypes and about 80 subtypes. Although the different genotypes and subtypes share basic biological and pathogenic features they differ in clinical outcomes, response to treatment and epidemiology. The first HCV recombinant strain, in which different genome segments derived from parentals of different genotypes, was described in St. Petersburg (Russia) in 2002. Since then, there have been only a few more than a dozen reports including descriptions of HCV recombinants at all levels: between genotypes, between subtypes of the same genotype and even between strains of the same subtype. Here, we review the literature considering the reasons underlying the difficulties for unequivocally establishing recombination in this virus along with the analytical methods necessary to do it. Finally, we analyze the potential consequences, especially in clinical practice, of HCV recombination in light of the coming new therapeutic approaches against this virus.
Collapse
|
45
|
Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K. Statistics and truth in phylogenomics. Mol Biol Evol 2011; 29:457-72. [PMID: 21873298 DOI: 10.1093/molbev/msr202] [Citation(s) in RCA: 176] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
Collapse
Affiliation(s)
- Sudhir Kumar
- Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University, Arizona, USA.
| | | | | | | | | |
Collapse
|
46
|
Te Velthuis AJW, Bagowski CP. Linking fold, function and phylogeny: a comparative genomics view on protein (domain) evolution. Curr Genomics 2011; 9:88-96. [PMID: 19440449 PMCID: PMC2674803 DOI: 10.2174/138920208784139537] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2008] [Revised: 03/20/2008] [Accepted: 03/25/2008] [Indexed: 11/22/2022] Open
Abstract
Domains are the building blocks of all globular proteins and present one of the most useful levels at which protein function can be understood. Through recombination and duplication of a limited set of domains, proteomes evolved and the collection of protein superfamilies in an organism formed. As such, the presence of a shared domain can be regarded as an indicator of similar function and evolutionary history, but it does not necessarily imply it since convergent evolution may give rise to similar gene functions as well as architectures.Through the wealth of sequences and annotation data brought about by genomics, evolutionary links can be sought for via homology relationships and comparative genomics, structural modeling and phylogenetics. The goal hereby is not only to predict the function of newly discovered proteins, but also to spell out their pathway of evolution and, possibly, identify their most likely origin. This can ultimately help to understand protein function and functional relationships of protein families. Additionally, through comparison with transcriptional data, evolutionary data can be linked to gene (and genome) activity and thus allow for the identification of common principles behind fast evolving proteins and relatively stable ones.In this review, we describe the basic principles of studying protein (domain) evolution and illustrate recent developments in molecular evolution and give valuable new insights in the field of comparative genomics. As an example, we include here molecular models of the multiple PDZ domain protein MUPP-1 and present a simple comparative genomic view on its structural course of evolution.
Collapse
Affiliation(s)
- Aartjan J W Te Velthuis
- Institute of Biology, Department of Molecular Virology, Leiden University Medical Centre, Albinusdreef 2, 2333 ZA Leiden, The Netherlands
| | | |
Collapse
|
47
|
Kornobis E, Pálsson S, Sidorov DA, Holsinger JR, Kristjánsson BK. Molecular taxonomy and phylogenetic affinities of two groundwater amphipods, Crangonyx islandicus and Crymostygius thingvallensis, endemic to Iceland. Mol Phylogenet Evol 2010; 58:527-39. [PMID: 21195201 DOI: 10.1016/j.ympev.2010.12.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2010] [Revised: 12/17/2010] [Accepted: 12/20/2010] [Indexed: 11/16/2022]
Abstract
The amphipod superfamily Crangonyctoidea is distributed exclusively in freshwater habitats worldwide and is characteristic of subterranean habitats. Two members of the family, Crangonyx islandicus and Crymostygius thingvallensis, are endemic to Iceland and were recently discovered in groundwater underneath lava fields. Crangonyx islandicus belongs to a well-known genus with representatives both in North America and in Eurasia. Crymostygius thingvallensis defines a new family, Crymostygidae. Considering the incongruences observed recently between molecular and morphological taxonomy within subterranean species, we aim to assess the taxonomical status of the two species using molecular data. Additionally, the study contributes to the phylogenetic relationships among several crangonyctoidean species and specifically among species from four genera of the family Crangonyctidae. Given the available data we consider how the two Icelandic species could have colonized Iceland, by comparing geographical origin of the species with the phylogeny. Regions of two nuclear (18S and 28S rRNA) and two mitochondrial genes (16S rRNA and COI) for 20 different species of three families of the Crangonyctoidea were sequenced. Four different methods were used to align the RNA gene sequences and phylogenetic trees were constructed using bayesian and maximum likelihood analysis. The Crangonyctidae monophyly is supported. Crangonyx islandicus appeared more closely related to species from the Nearctic region. Crymostygius thingvallensis is clearly divergent from the other species of Crangonyctoidea. Crangonyx and Synurella genera are clearly polyphyletic and showed a geographical association, being split into a Nearctic and a Palearctic group. This research confirms that the studied species of Crangonyctidae share a common ancestor, which was probably widespread in the Northern hemisphere well before the break up of Laurasia. The Icelandic species are of particular interest since Iceland emerged after the separation of Eurasia and North America, is geographically isolated and has repeatedly been covered by glaciers during the Ice Age. The close relation between Crangonyx islandicus and North American species supports the hypothesis of the Trans-Atlantic land bridge between Greenland and Iceland which might have persisted until 6 million years ago. The status of the family Crymostygidae is supported, whereas Crangonyx islandicus might represent a new genus. As commonly observed in subterranean animals, molecular and morphological taxonomy led to different conclusions, probably due to convergent evolution of morphological traits. Our molecular analysis suggests that the family Crangonyctidae needs taxonomic revisions.
Collapse
Affiliation(s)
- Etienne Kornobis
- Department of Biology, University of Iceland, Reykjavik, Iceland.
| | | | | | | | | |
Collapse
|
48
|
Unterseher M, Schnittler M. Species richness analysis and ITS rDNA phylogeny revealed the majority of cultivable foliar endophytes from beech (Fagus sylvatica). FUNGAL ECOL 2010. [DOI: 10.1016/j.funeco.2010.03.001] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
49
|
Sahraeian SME, Yoon BJ. PicXAA: greedy probabilistic construction of maximum expected accuracy alignment of multiple sequences. Nucleic Acids Res 2010; 38:4917-28. [PMID: 20413579 PMCID: PMC2926610 DOI: 10.1093/nar/gkq255] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Revised: 03/25/2010] [Accepted: 03/26/2010] [Indexed: 11/13/2022] Open
Abstract
Accurate tools for multiple sequence alignment (MSA) are essential for comparative studies of the function and structure of biological sequences. However, it is very challenging to develop a computationally efficient algorithm that can consistently predict accurate alignments for various types of sequence sets. In this article, we introduce PicXAA (Probabilistic Maximum Accuracy Alignment), a probabilistic non-progressive alignment algorithm that aims to find protein alignments with maximum expected accuracy. PicXAA greedily builds up the multiple alignment from sequence regions with high local similarities, thereby yielding an accurate global alignment that effectively grasps the local similarities among sequences. Evaluations on several widely used benchmark sets show that PicXAA constantly yields accurate alignment results on a wide range of reference sets, with especially remarkable improvements over other leading algorithms on sequence sets with local similarities. PicXAA source code is freely available at: http://www.ece.tamu.edu/~bjyoon/picxaa/.
Collapse
Affiliation(s)
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
50
|
Bagowski CP, Bruins W, te Velthuis AJ. The nature of protein domain evolution: shaping the interaction network. Curr Genomics 2010; 11:368-76. [PMID: 21286315 PMCID: PMC2945003 DOI: 10.2174/138920210791616725] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2010] [Revised: 06/04/2010] [Accepted: 06/13/2010] [Indexed: 11/30/2022] Open
Abstract
The proteomes that make up the collection of proteins in contemporary organisms evolved through recombination and duplication of a limited set of domains. These protein domains are essentially the main components of globular proteins and are the most principal level at which protein function and protein interactions can be understood. An important aspect of domain evolution is their atomic structure and biochemical function, which are both specified by the information in the amino acid sequence. Changes in this information may bring about new folds, functions and protein architectures. With the present and still increasing wealth of sequences and annotation data brought about by genomics, new evolutionary relationships are constantly being revealed, unknown structures modeled and phylogenies inferred. Such investigations not only help predict the function of newly discovered proteins, but also assist in mapping unforeseen pathways of evolution and reveal crucial, co-evolving inter- and intra-molecular interactions. In turn this will help us describe how protein domains shaped cellular interaction networks and the dynamics with which they are regulated in the cell. Additionally, these studies can be used for the design of new and optimized protein domains for therapy. In this review, we aim to describe the basic concepts of protein domain evolution and illustrate recent developments in molecular evolution that have provided valuable new insights in the field of comparative genomics and protein interaction networks.
Collapse
Affiliation(s)
- Christoph P Bagowski
- German University Cairo, Faculty of Pharmacy and Biotechnology, New Cairo City, Egypt
| | - Wouter Bruins
- Institute of Biology, Leiden University, 2333 AL Leiden, The Netherlands
| | - Aartjan J.W te Velthuis
- Department of Medical Microbiology, Molecular Virology Laboratory, Leiden University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands
- Department of Bionanoscience, Delft University of Technology, Lorentzweg 1, 2628 CJ, Delft, The Netherlands
| |
Collapse
|