1
|
Fang L, Liu T, Li M, Dong X, Han Y, Xu C, Li S, Zhang J, He X, Zhou Q, Luo D, Liu Z. MODMS: a multi-omics database for facilitating biological studies on alfalfa ( Medicago sativa L.). HORTICULTURE RESEARCH 2024; 11:uhad245. [PMID: 38239810 PMCID: PMC10794946 DOI: 10.1093/hr/uhad245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 11/13/2023] [Indexed: 01/22/2024]
Abstract
Alfalfa (Medicago sativa L.) is a globally important forage crop. It also serves as a vegetable and medicinal herb because of its excellent nutritional quality and significant economic value. Multi-omics data on alfalfa continue to accumulate owing to recent advances in high-throughput techniques, and integrating this information holds great potential for expediting genetic research and facilitating advances in alfalfa agronomic traits. Therefore, we developed a comprehensive database named MODMS (multi-omics database of M. sativa) that incorporates multiple reference genomes, annotations, comparative genomics, transcriptomes, high-quality genomic variants, proteomics, and metabolomics. This report describes our continuously evolving database, which provides researchers with several convenient tools and extensive omics data resources, facilitating the expansion of alfalfa research. Further details regarding the MODMS database are available at https://modms.lzu.edu.cn/.
Collapse
Affiliation(s)
- Longfa Fang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Tao Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Mingyu Li
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - XueMing Dong
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Yuling Han
- Tropical Crops Genetic Resources Institute, Chinese Academy of Tropical Agricultural Sciences, Haikou 571101, China
| | - Congzhuo Xu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Siqi Li
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Jia Zhang
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Xiaojuan He
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Qiang Zhou
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Dong Luo
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| | - Zhipeng Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystems, College of Pastoral Agriculture Science and Technology, Lanzhou University, Lanzhou 730020, China
| |
Collapse
|
2
|
Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. BIOINFORMATICS ADVANCES 2023; 3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]
Abstract
Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Computer Science, Brunel University London, London UB8 3PH, UK
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, 75015 Paris, France
| | - Solon P Pissis
- Networks & Optimization, CWI, 1098 XG Amsterdam, The Netherlands
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
3
|
Zhang Z, Cui M, Chen P, Li J, Mao Z, Mao Y, Li Z, Guo Q, Wang C, Liao X, Liu H. Insight into the phylogeny and metabolic divergence of Monascus species ( M. pilosus, M. ruber, and M. purpureus) at the genome level. Front Microbiol 2023; 14:1199144. [PMID: 37303795 PMCID: PMC10249731 DOI: 10.3389/fmicb.2023.1199144] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 05/09/2023] [Indexed: 06/13/2023] Open
Abstract
Background Species of the genus Monascus are economically important and widely used in the production of food colorants and monacolin K. However, they have also been known to produce the mycotoxin citrinin. Currently, taxonomic knowledge of this species at the genome level is insufficient. Methods This study presents genomic similarity analyses through the analysis of the average nucleic acid identity of the genomic sequence and the whole genome alignment. Subsequently, the study constructed a pangenome of Monascus by reannotating all the genomes and identifying a total of 9,539 orthologous gene families. Two phylogenetic trees were constructed based on 4,589 single copy orthologous protein sequences and all the 5,565 orthologous proteins, respectively. In addition, carbohydrate active enzymes, secretome, allergic proteins, as well as secondary metabolite gene clusters were compared among the included 15 Monascus strains. Results The results clearly revealed a high homology between M. pilosus and M. ruber, and their distant relationship with M. purpureus. Accordingly, all the included 15 Monascus strains should be classified into two distinctly evolutionary clades, namely the M. purpureus clade and the M. pilosus-M. ruber clade. Moreover, gene ontology enrichment showed that the M. pilosus-M. ruber clade had more orthologous genes involved with environmental adaptation than the M. purpureus clade. Compared to Aspergillus oryzae, all the Monascus species had a substantial gene loss of carbohydrate active enzymes. Potential allergenic and fungal virulence factor proteins were also found in the secretome of Monascus. Furthermore, this study identified the pigment synthesis gene clusters present in all included genomes, but with multiple nonessential genes inserted in the gene cluster of M. pilosus and M. ruber compared to M. purpureus. The citrinin gene cluster was found to be intact and highly conserved only among M. purpureus genomes. The monacolin K gene cluster was found only in the genomes of M. pilosus and M. ruber, but the sequence was more conserved in M. ruber. Conclusion This study provides a paradigm for phylogenetic analysis of the genus Monascus, and it is believed that this report will lead to a better understanding of these food microorganisms in terms of classification, metabolic differentiation, and safety.
Collapse
Affiliation(s)
- Zhiyu Zhang
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Mengfei Cui
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Panting Chen
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Juxing Li
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Zhitao Mao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Yufeng Mao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
| | - Zhenjing Li
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Qingbin Guo
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Changlu Wang
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| | - Xiaoping Liao
- Biodesign Center, Key Laboratory of Engineering Biology for Low-Carbon Manufacturing, Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, China
- Haihe Laboratory of Synthetic Biology, Tianjin, China
| | - Huanhuan Liu
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Tianjin, China
- State Key Laboratory of Food Nutrition and Safety, Tianjin University of Science and Technology, Ministry of Education, Tianjin, China
| |
Collapse
|
4
|
Abstract
The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplication, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy. We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches, which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differences among the various orthology inference methods and databases and examine the difficult issue of verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous genes, groups, and reconciled trees and conclude with thoughts on future methodological developments.
Collapse
|
5
|
Khelik K, Lagesen K, Sandve GK, Rognes T, Nederbragt AJ. NucDiff: in-depth characterization and annotation of differences between two sets of DNA sequences. BMC Bioinformatics 2017; 18:338. [PMID: 28701187 PMCID: PMC5508607 DOI: 10.1186/s12859-017-1748-z] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 07/04/2017] [Indexed: 12/05/2022] Open
Abstract
Background Comparing sets of sequences is a situation frequently encountered in bioinformatics, examples being comparing an assembly to a reference genome, or two genomes to each other. The purpose of the comparison is usually to find where the two sets differ, e.g. to find where a subsequence is repeated or deleted, or where insertions have been introduced. Such comparisons can be done using whole-genome alignments. Several tools for making such alignments exist, but none of them 1) provides detailed information about the types and locations of all differences between the two sets of sequences, 2) enables visualisation of alignment results at different levels of detail, and 3) carefully takes genomic repeats into consideration. Results We here present NucDiff, a tool aimed at locating and categorizing differences between two sets of closely related DNA sequences. NucDiff is able to deal with very fragmented genomes, repeated sequences, and various local differences and structural rearrangements. NucDiff determines differences by a rigorous analysis of alignment results obtained by the NUCmer, delta-filter and show-snps programs in the MUMmer sequence alignment package. All differences found are categorized according to a carefully defined classification scheme covering all possible differences between two sequences. Information about the differences is made available as GFF3 files, thus enabling visualisation using genome browsers as well as usage of the results as a component in an analysis pipeline. NucDiff was tested with varying parameters for the alignment step and compared with existing alternatives, called QUAST and dnadiff. Conclusions We have developed a whole genome alignment difference classification scheme together with the program NucDiff for finding such differences. The proposed classification scheme is comprehensive and can be used by other tools. NucDiff performs comparably to QUAST and dnadiff but gives much more detailed results that can easily be visualized. NucDiff is freely available on https://github.com/uio-cels/NucDiff under the MPL license. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1748-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ksenia Khelik
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Karin Lagesen
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Norwegian Veterinary Institute, PO Box 750 Sentrum, 0106, Oslo, Norway
| | - Geir Kjetil Sandve
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway
| | - Torbjørn Rognes
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway.,Department of Microbiology, Oslo University Hospital, Rikshospitalet, PO Box 4950 Nydalen, 0424, Oslo, Norway
| | - Alexander Johan Nederbragt
- Biomedical Informatics Research Group, Department of Informatics, University of Oslo, PO Box 1080, 0316, Oslo, Norway. .,Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, PO Box 1066 Blindern, 0316, Oslo, Norway.
| |
Collapse
|
6
|
Goryunov DV, Nagaev BE, Nikolaev MY, Alexeevski AV, Troitsky AV. Moss Phylogeny Reconstruction Using Nucleotide Pangenome of Complete Mitogenome Sequences. BIOCHEMISTRY (MOSCOW) 2016; 80:1522-7. [PMID: 26615445 DOI: 10.1134/s0006297915110152] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Stability of composition and sequence of genes was shown earlier in 13 mitochondrial genomes of mosses (Rensing, S. A., et al. (2008) Science, 319, 64-69). It is of interest to study the evolution of mitochondrial genomes not only at the gene level, but also on the level of nucleotide sequences. To do this, we have constructed a "nucleotide pangenome" for mitochondrial genomes of 24 moss species. The nucleotide pangenome is a set of aligned nucleotide sequences of orthologous genome fragments covering the totality of all genomes. The nucleotide pangenome was constructed using specially developed new software, NPG-explorer (NPGe). The stable part of the mitochondrial genome (232 stable blocks) is shown to be, on average, 45% of its length. In the joint alignment of stable blocks, 82% of positions are conserved. The phylogenetic tree constructed with the NPGe program is in good correlation with other phylogenetic reconstructions. With the NPGe program, 30 blocks have been identified with repeats no shorter than 50 bp. The maximal length of a block with repeats is 140 bp. Duplications in the mitochondrial genomes of mosses are rare. On average, the genome contains about 500 bp in large duplications. The total length of insertions and deletions was determined in each genome. The losses and gains of DNA regions are rather active in mitochondrial genomes of mosses, and such rearrangements presumably can be used as additional markers in the reconstruction of phylogeny.
Collapse
Affiliation(s)
- D V Goryunov
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, 119991, Russia.
| | | | | | | | | |
Collapse
|
7
|
Abstract
The number of large-scale genomics projects is increasing due to the availability of affordable high-throughput sequencing (HTS) technologies. The use of HTS for bacterial infectious disease research is attractive because one whole-genome sequencing (WGS) run can replace multiple assays for bacterial typing, molecular epidemiology investigations, and more in-depth pathogenomic studies. The computational resources and bioinformatics expertise required to accommodate and analyze the large amounts of data pose new challenges for researchers embarking on genomics projects for the first time. Here, we present a comprehensive overview of a bacterial genomics projects from beginning to end, with a particular focus on the planning and computational requirements for HTS data, and provide a general understanding of the analytical concepts to develop a workflow that will meet the objectives and goals of HTS projects.
Collapse
|
8
|
Sharma V, Elghafari A, Hiller M. Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation. Nucleic Acids Res 2016; 44:e103. [PMID: 27016733 PMCID: PMC4914097 DOI: 10.1093/nar/gkw210] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Revised: 03/04/2016] [Accepted: 03/18/2016] [Indexed: 12/03/2022] Open
Abstract
Identifying coding genes is an essential step in genome annotation. Here, we utilize existing whole genome alignments to detect conserved coding exons and then map gene annotations from one genome to many aligned genomes. We show that genome alignments contain thousands of spurious frameshifts and splice site mutations in exons that are truly conserved. To overcome these limitations, we have developed CESAR (Coding Exon-Structure Aware Realigner) that realigns coding exons, while considering reading frame and splice sites of each exon. CESAR effectively avoids spurious frameshifts in conserved genes and detects 91% of shifted splice sites. This results in the identification of thousands of additional conserved exons and 99% of the exons that lack inactivating mutations match real exons. Finally, to demonstrate the potential of using CESAR for comparative gene annotation, we applied it to 188 788 exons of 19 865 human genes to annotate human genes in 99 other vertebrates. These comparative gene annotations are available as a resource (http://bds.mpi-cbg.de/hillerlab/CESAR/). CESAR (https://github.com/hillerlab/CESAR/) can readily be applied to other alignments to accurately annotate coding genes in many other vertebrate and invertebrate genomes.
Collapse
Affiliation(s)
- Virag Sharma
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany
| | - Anas Elghafari
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany Technical University, 01069 Dresden, Germany
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Pfotenhauerstr. 108, 01307 Dresden, Germany Max Planck Institute for the Physics of Complex Systems, Nöthnitzer Str. 38, 01187 Dresden, Germany
| |
Collapse
|
9
|
Frith MC, Kawaguchi R. Split-alignment of genomes finds orthologies more accurately. Genome Biol 2015; 16:106. [PMID: 25994148 PMCID: PMC4464727 DOI: 10.1186/s13059-015-0670-9] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2015] [Accepted: 05/08/2015] [Indexed: 04/29/2023] Open
Abstract
We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.
Collapse
Affiliation(s)
- Martin C Frith
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| | - Risa Kawaguchi
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Department of Computational Biology, Faculty of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba, 277-8561, Japan.
| |
Collapse
|
10
|
|
11
|
Kehr B, Trappe K, Holtgrewe M, Reinert K. Genome alignment with graph data structures: a comparison. BMC Bioinformatics 2014; 15:99. [PMID: 24712884 PMCID: PMC4020321 DOI: 10.1186/1471-2105-15-99] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2013] [Accepted: 03/28/2014] [Indexed: 12/21/2022] Open
Abstract
Background Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete genome sequences. The computational approach of multiple genome alignment allows investigation of evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as rearrangement studies and phylogenetic inference. Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a genome alignment through identification and removal of graph substructures that indicate errors in the alignment. Results We compare the structures of commonly used graphs in terms of their abilities to represent alignment information. We describe how the graphs can be transformed into each other, and identify and classify graph substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that remove these substructures. Conclusion We show that crucial pieces of alignment information, associated with inversions and duplications, are not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a conceptual framework for graph-based genome alignment that can assist in the development of future genome alignment tools.
Collapse
Affiliation(s)
- Birte Kehr
- Department of Computer Science, Freie Universität Berlin, Takustr, 9, 14195 Berlin, Germany.
| | | | | | | |
Collapse
|
12
|
Harris SR, Okoro CK. Whole-Genome Sequencing for Rapid and Accurate Identification of Bacterial Transmission Pathways. J Microbiol Methods 2014. [DOI: 10.1016/bs.mim.2014.07.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
|
13
|
Abstract
SATé is a method for estimating multiple sequence alignments and trees that has been shown to produce highly accurate results for datasets with large numbers of sequences. Running SATé using its default settings is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide a detailed introduction to the algorithmic approach used by SATé, and instructions for running a SATé analysis using the GUI under default settings. We also provide a discussion of how to modify these settings to obtain improved results, and how to use SATé in a phylogenetic analysis pipeline.
Collapse
Affiliation(s)
- Kevin Liu
- Department of Computer Science, Rice University, Houston, TX, USA
| | | |
Collapse
|
14
|
Paris M, Kaplan T, Li XY, Villalta JE, Lott SE, Eisen MB. Extensive divergence of transcription factor binding in Drosophila embryos with highly conserved gene expression. PLoS Genet 2013; 9:e1003748. [PMID: 24068946 PMCID: PMC3772039 DOI: 10.1371/journal.pgen.1003748] [Citation(s) in RCA: 74] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2013] [Accepted: 07/10/2013] [Indexed: 11/19/2022] Open
Abstract
To better characterize how variation in regulatory sequences drives divergence in gene expression, we undertook a systematic study of transcription factor binding and gene expression in blastoderm embryos of four species, which sample much of the diversity in the 40 million-year old genus Drosophila: D. melanogaster, D. yakuba, D. pseudoobscura and D. virilis. We compared gene expression, measured by mRNA-seq, to the genome-wide binding, measured by ChIP-seq, of four transcription factors involved in early anterior-posterior patterning. We found that mRNA levels are much better conserved than individual transcription factor binding events, and that changes in a gene's expression were poorly explained by changes in adjacent transcription factor binding. However, highly bound sites, sites in regions bound by multiple factors and sites near genes are conserved more frequently than other binding, suggesting that a considerable amount of transcription factor binding is weakly or non-functional and not subject to purifying selection.
Collapse
Affiliation(s)
- Mathilde Paris
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
| | - Tommy Kaplan
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
- School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel
| | - Xiao Yong Li
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, California, United States of America
| | | | - Susan E. Lott
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
- Department of Evolution and Ecology, University of California, Davis, California, United States of America
| | - Michael B. Eisen
- Department of Molecular and Cell Biology, University of California Berkeley, Berkeley, California, United States of America
- School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel
- Howard Hughes Medical Institute, University of California Berkeley, Berkeley, California, United States of America
| |
Collapse
|
15
|
Abstract
The distinction between orthologs and paralogs, genes that started diverging by speciation versus duplication, is relevant in a wide range of contexts, most notably phylogenetic tree inference and protein function annotation. In this chapter, we provide an overview of the methods used to infer orthology and paralogy. We survey both graph-based approaches (and their various grouping strategies) and tree-based approaches, which solve the more general problem of gene/species tree reconciliation. We discuss conceptual differences among the various orthology inference methods and databases, and examine the difficult issue of verifying and benchmarking orthology predictions. Finally, we review typical applications of orthologous genes, groups, and reconciled trees and conclude with thoughts on future methodological developments.
Collapse
|
16
|
Hatje K, Kollmar M. A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method. FRONTIERS IN PLANT SCIENCE 2012; 3:192. [PMID: 22952468 PMCID: PMC3429886 DOI: 10.3389/fpls.2012.00192] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/25/2012] [Accepted: 08/06/2012] [Indexed: 05/06/2023]
Abstract
Phylogenetic analyses reveal the evolutionary derivation of species. A phylogenetic tree can be inferred from multiple sequence alignments of proteins or genes. The alignment of whole genome sequences of higher eukaryotes is a computational intensive and ambitious task as is the computation of phylogenetic trees based on these alignments. To overcome these limitations, we here used an alignment-free method to compare genomes of the Brassicales clade. For each nucleotide sequence a Chaos Game Representation (CGR) can be computed, which represents each nucleotide of the sequence as a point in a square defined by the four nucleotides as vertices. Each CGR is therefore a unique fingerprint of the underlying sequence. If the CGRs are divided by grid lines each grid square denotes the occurrence of oligonucleotides of a specific length in the sequence (Frequency Chaos Game Representation, FCGR). Here, we used distance measures between FCGRs to infer phylogenetic trees of Brassicales species. Three types of data were analyzed because of their different characteristics: (A) Whole genome assemblies as far as available for species belonging to the Malvidae taxon. (B) EST data of species of the Brassicales clade. (C) Mitochondrial genomes of the Rosids branch, a supergroup of the Malvidae. The trees reconstructed based on the Euclidean distance method are in general agreement with single gene trees. The Fitch-Margoliash and Neighbor joining algorithms resulted in similar to identical trees. Here, for the first time we have applied the bootstrap re-sampling concept to trees based on FCGRs to determine the support of the branchings. FCGRs have the advantage that they are fast to calculate, and can be used as additional information to alignment based data and morphological characteristics to improve the phylogenetic classification of species in ambiguous cases.
Collapse
Affiliation(s)
- Klas Hatje
- Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische ChemieGöttingen, Germany
| | - Martin Kollmar
- Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische ChemieGöttingen, Germany
- *Correspondence: Martin Kollmar, Abteilung NMR-Basierte Strukturbiologie, Max-Planck-Institut für Biophysikalische Chemie, Am Fassberg 11, D-37077 Göttingen, Germany. e-mail:
| |
Collapse
|
17
|
Löytynoja A. Alignment methods: strategies, challenges, benchmarking, and comparative overview. Methods Mol Biol 2012; 855:203-35. [PMID: 22407710 DOI: 10.1007/978-1-61779-582-4_7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Comparative evolutionary analyses of molecular sequences are solely based on the identities and differences detected between homologous characters. Errors in this homology statement, that is errors in the alignment of the sequences, are likely to lead to errors in the downstream analyses. Sequence alignment and phylogenetic inference are tightly connected and many popular alignment programs use the phylogeny to divide the alignment problem into smaller tasks. They then neglect the phylogenetic tree, however, and produce alignments that are not evolutionarily meaningful. The use of phylogeny-aware methods reduces the error but the resulting alignments, with evolutionarily correct representation of homology, can challenge the existing practices and methods for viewing and visualising the sequences. The inter-dependency of alignment and phylogeny can be resolved by joint estimation of the two; methods based on statistical models allow for inferring the alignment parameters from the data and correctly take into account the uncertainty of the solution but remain computationally challenging. Widely used alignment methods are based on heuristic algorithms and unlikely to find globally optimal solutions. The whole concept of one correct alignment for the sequences is questionable, however, as there typically exist vast numbers of alternative, roughly equally good alignments that should also be considered. This uncertainty is hidden by many popular alignment programs and is rarely correctly taken into account in the downstream analyses. The quest for finding and improving the alignment solution is complicated by the lack of suitable measures of alignment goodness. The difficulty of comparing alternative solutions also affects benchmarks of alignment methods and the results strongly depend on the measure used. As the effects of alignment error cannot be predicted, comparing the alignments' performance in downstream analyses is recommended.
Collapse
Affiliation(s)
- Ari Löytynoja
- European Bioinformatics Institute (EMBL), Hinxton, UK.
| |
Collapse
|