1
|
Ijaz AZ, Ali RH, Sarwar A, Ali Khan T, Baig MM. Importance of Synteny in Homology Inference. 2022 17TH INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES (ICET) 2022. [DOI: 10.1109/icet56601.2022.10004649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Affiliation(s)
- Ali Zeeshan Ijaz
- AI Research Group GIK Institute of Engg. Sciences & Tech.,Faculty of Computer Science & Engg.,Topi,Khyber Pakhtunkhwa,Pakistan
| | - Raja Hashim Ali
- AI Research Group GIK Institute of Engg. Sciences & Tech.,Faculty of Computer Science & Engg.,Topi,Khyber Pakhtunkhwa,Pakistan
| | - Asima Sarwar
- AI Research Group GIK Institute of Engg. Sciences & Tech.,Faculty of Computer Science & Engg.,Topi,Khyber Pakhtunkhwa,Pakistan
| | - Talha Ali Khan
- Univ. of Europe of Applied Sciences,Dept. of Tech & Software Engg.,Berlin,Germany
| | - Muhammad Muneeb Baig
- AI Research Group GIK Institute of Engg. Sciences & Tech.,Faculty of Computer Science & Engg.,Topi,Khyber Pakhtunkhwa,Pakistan
| |
Collapse
|
2
|
Gauthier CH, Cresawn SG, Hatfull GF. PhaMMseqs: a new pipeline for constructing phage gene phamilies using MMseqs2. G3 (BETHESDA, MD.) 2022; 12:6717792. [PMID: 36161315 PMCID: PMC9635663 DOI: 10.1093/g3journal/jkac233] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 08/30/2022] [Indexed: 06/09/2023]
Abstract
The diversity and mosaic architecture of phage genomes present challenges for whole-genome phylogenies and comparative genomics. There are no universally conserved core genes, ∼70% of phage genes are of unknown function, and phage genomes are replete with small (<500 bp) open reading frames. Assembling sequence-related genes into "phamilies" ("phams") based on amino acid sequence similarity simplifies comparative phage genomics and facilitates representations of phage genome mosaicism. With the rapid and substantial increase in the numbers of sequenced phage genomes, computationally efficient pham assembly is needed, together with strategies for including newly sequenced phage genomes. Here, we describe the Python package PhaMMseqs, which uses MMseqs2 for pham assembly, and we evaluate the key parameters for optimal pham assembly of sequence- and functionally related proteins. PhaMMseqs runs efficiently with only modest hardware requirements and integrates with the pdm_utils package for simple genome entry and export of datasets for evolutionary analyses and phage genome map construction.
Collapse
Affiliation(s)
- Christian H Gauthier
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Steven G Cresawn
- Department of Biology, James Madison University, Harrisonburg, VA 22807, USA
| | - Graham F Hatfull
- Corresponding author: Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| |
Collapse
|
3
|
In-Silico Evaluation of a New Gene From Wheat Reveals the Divergent Evolution of the CAP160 Homologous Genes Into Monocots. J Mol Evol 2019; 88:151-163. [PMID: 31820048 DOI: 10.1007/s00239-019-09920-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Accepted: 11/19/2019] [Indexed: 10/25/2022]
Abstract
This study reports the evolutionary history and in-silico functional characterization of a novel water-deficit and ABA-responsive gene in wheat. This gene has remote sequence similarity to known abiotic stress-related genes in different plants, including CAP160 in Spinacia oleracea, RD29B in Arabidopsis thaliana, and CDeT11-24 in Craterostigma plantagineum. The study investigated if these genes form a close homologous relationship or if they are a result of convergent evolutionary processes. The results indicated a closely shared homologous relationship between these genes. Bayesian phylogenetic analysis of the protein sequences of the remotely related CAP160 proteins from various plant species indicated the presence of three distinct clades. Further analyses indicated that CAP160 homologous genes have predominantly evolved through neutral processes, with multiple regions experiencing signatures of purifying selection, while others were indicated to be the result of episodic diversifying selection events. Functional predictions revealed that these genes might share at least two functions related to abiotic stress conditions: one similar to the cryoprotective function of LEA protein, and the other a signalling molecule with phosphatidic acid binding specificity. Studies focused on the identification of cold-responsive genes are essential for the development of cold-tolerant crop plants, if we are to increase agricultural productivity throughout temperate regions.
Collapse
|
4
|
Li L, Bansal MS. An Integrated Reconciliation Framework for Domain, Gene, and Species Level Evolution. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:63-76. [PMID: 29994126 DOI: 10.1109/tcbb.2018.2846253] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The majority of genes in eukaryotes consists of one or more protein domains that can be independently lost or gained during evolution. This gain and loss of protein domains, through domain duplications, transfers, or losses, has important evolutionary and functional consequences. Yet, even though it is well understood that domains evolve inside genes and genes inside species, there do not exist any computational frameworks to simultaneously model the evolution of domains, genes, and species and account for their inter-dependency. Here, we develop an integrated model of domain evolution that explicitly captures the interdependence of domain-, gene-, and species-level evolution. Our model extends the classical phylogenetic reconciliation framework, which infers gene family evolution by comparing gene trees and species trees, by explicitly considering domain-level evolution and decoupling domain-level events from gene-level events. In this paper, we (i) introduce the new integrated reconciliation framework, (ii) prove that the associated optimization problem is NP-hard, (iii) devise an efficient heuristic solution for the problem, (iv) apply our algorithm to a large biological dataset, and (v) demonstrate the impact of using our new computational framework compared to existing approaches. The implemented software is freely available from http://compbio.engr.uconn.edu/software/seadog/.
Collapse
|
5
|
GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm. BMC Evol Biol 2016; 16:120. [PMID: 27260514 PMCID: PMC4893229 DOI: 10.1186/s12862-016-0684-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 05/12/2016] [Indexed: 11/24/2022] Open
Abstract
Background Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. Results In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. Conclusions The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods. Electronic supplementary material The online version of this article (doi:10.1186/s12862-016-0684-2) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Bitard-Feildel T, Kemena C, Greenwood JM, Bornberg-Bauer E. Domain similarity based orthology detection. BMC Bioinformatics 2015; 16:154. [PMID: 25968113 PMCID: PMC4443542 DOI: 10.1186/s12859-015-0570-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2014] [Accepted: 04/10/2015] [Indexed: 11/10/2022] Open
Abstract
Background Orthologous protein detection software mostly uses pairwise comparisons of amino-acid sequences to assert whether two proteins are orthologous or not. Accordingly, when the number of sequences for comparison increases, the number of comparisons to compute grows in a quadratic order. A current challenge of bioinformatic research, especially when taking into account the increasing number of sequenced organisms available, is to make this ever-growing number of comparisons computationally feasible in a reasonable amount of time. We propose to speed up the detection of orthologous proteins by using strings of domains to characterize the proteins. Results We present two new protein similarity measures, a cosine and a maximal weight matching score based on domain content similarity, and new software, named porthoDom. The qualities of the cosine and the maximal weight matching similarity measures are compared against curated datasets. The measures show that domain content similarities are able to correctly group proteins into their families. Accordingly, the cosine similarity measure is used inside porthoDom, the wrapper developed for proteinortho. porthoDom makes use of domain content similarity measures to group proteins together before searching for orthologs. By using domains instead of amino acid sequences, the reduction of the search space decreases the computational complexity of an all-against-all sequence comparison. Conclusion We demonstrate that representing and comparing proteins as strings of discrete domains, i.e. as a concatenation of their unique identifiers, allows a drastic simplification of search space. porthoDom has the advantage of speeding up orthology detection while maintaining a degree of accuracy similar to proteinortho. The implementation of porthoDom is released using python and C++ languages and is available under the GNU GPL licence 3 at http://www.bornberglab.org/pages/porthoda. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0570-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tristan Bitard-Feildel
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstr. 1, Münster, Germany.
| | - Carsten Kemena
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstr. 1, Münster, Germany.
| | - Jenny M Greenwood
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstr. 1, Münster, Germany.
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, University of Münster, Hüfferstr. 1, Münster, Germany.
| |
Collapse
|
7
|
Doerr D, Stoye J, Böcker S, Jahn K. Identifying gene clusters by discovering common intervals in indeterminate strings. BMC Genomics 2015; 15 Suppl 6:S2. [PMID: 25571793 PMCID: PMC4274641 DOI: 10.1186/1471-2164-15-s6-s2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Background Comparative analyses of chromosomal gene orders are successfully used to predict
gene clusters in bacterial and fungal genomes. Present models for detecting sets
of co-localized genes in chromosomal sequences require prior knowledge of gene
family assignments of genes in the dataset of interest. These families are often
computationally predicted on the basis of sequence similarity or higher order
features of gene products. Errors introduced in this process amplify in subsequent
gene order analyses and thus may deteriorate gene cluster prediction. Results In this work, we present a new dynamic model and efficient computational
approaches for gene cluster prediction suitable in scenarios ranging from
traditional gene family-based gene cluster prediction, via multiple conflicting
gene family annotations, to gene family-free analysis, in which gene clusters are
predicted solely on the basis of a pairwise similarity measure of the genes of
different genomes. We evaluate our gene family-free model against a gene
family-based model on a dataset of 93 bacterial genomes. Conclusions Our model is able to detect gene clusters that would be also detected with
well-established gene family-based approaches. Moreover, we show that it is able
to detect conserved regions which are missed by gene family-based methods due to
wrong or deficient gene family assignments.
Collapse
|
8
|
Massive fungal biodiversity data re-annotation with multi-level clustering. Sci Rep 2014; 4:6837. [PMID: 25355642 PMCID: PMC4213798 DOI: 10.1038/srep06837] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2014] [Accepted: 10/10/2014] [Indexed: 11/08/2022] Open
Abstract
With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.
Collapse
|
9
|
Zheng C, Kononenko A, Leebens-Mack J, Lyons E, Sankoff D. Gene families as soft cliques with backbones: Amborella contrasted with other flowering plants. BMC Genomics 2014; 15 Suppl 6:S8. [PMID: 25572777 PMCID: PMC4240082 DOI: 10.1186/1471-2164-15-s6-s8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Chaining is a major problem in constructing gene families. RESULTS We define a new kind of cluster on graphs with strong and weak edges: soft cliques with backbones (SCWiB). This differs from other definitions in how it controls the "chaining effect", by ensuring clusters satisfy a tolerant edge density criterion that takes into account cluster size. We implement algorithms for decomposing a graph of similarities into SCWiBs. We compare examples of output from SCWiB and the Markov Cluster Algorithm (MCL), and also compare some curated Arabidopsis thaliana gene families with the results of automatic clustering. We apply our method to 44 published angiosperm genomes with annotation, and discover that Amborella trichopoda is distinct from all the others in having substantially and systematically smaller proportions of moderate- and large-size gene families. CONCLUSIONS We offer several possible evolutionary explanations for this result.
Collapse
Affiliation(s)
- Chunfang Zheng
- Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Canada, K1N 6N5
| | - Alexey Kononenko
- Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Canada, K1N 6N5
| | - Jim Leebens-Mack
- Department of Plant Biology, University of Georgia, Athens, GA 30602-7271, USA
| | - Eric Lyons
- The School of Plant Sciences, University of Arizona, Tucson, AZ 85721 USA
| | - David Sankoff
- Department of Mathematics and Statistics, University of Ottawa, 585 King Edward Avenue, Ottawa, Canada, K1N 6N5
| |
Collapse
|
10
|
Ali RH, Muhammad S, Khan M, Arvestad L. Quantitative synteny scoring improves homology inference and partitioning of gene families. BMC Bioinformatics 2014; 14 Suppl 15:S12. [PMID: 24564516 PMCID: PMC3852004 DOI: 10.1186/1471-2105-14-s15-s12] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Results Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. Conclusions The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.
Collapse
|
11
|
Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PLoS One 2013; 8:e75458. [PMID: 24069417 PMCID: PMC3771926 DOI: 10.1371/journal.pone.0075458] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2013] [Accepted: 08/19/2013] [Indexed: 11/19/2022] Open
Abstract
Identifying shared sequence segments along amino acid sequences generally requires a collection of closely related proteins, most often curated manually from the sequence datasets to suit the purpose at hand. Currently developed statistical methods are strained, however, when the collection contains remote sequences with poor alignment to the rest, or sequences containing multiple domains. In this paper, we propose a completely unsupervised and automated method to identify the shared sequence segments observed in a diverse collection of protein sequences including those present in a smaller fraction of the sequences in the collection, using a combination of sequence alignment, residue conservation scoring and graph-theoretical approaches. Since shared sequence fragments often imply conserved functional or structural attributes, the method produces a table of associations between the sequences and the identified conserved regions that can reveal previously unknown protein families as well as new members to existing ones. We evaluated the biological relevance of the method by clustering the proteins in gold standard datasets and assessing the clustering performance in comparison with previous methods from the literature. We have then applied the proposed method to a genome wide dataset of 17793 human proteins and generated a global association map to each of the 4753 identified conserved regions. Investigations on the major conserved regions revealed that they corresponded strongly to annotated structural domains. This suggests that the method can be useful in predicting novel domains on protein sequences.
Collapse
|
12
|
Ramos-Silva P, Kaandorp J, Huisman L, Marie B, Zanella-Cléon I, Guichard N, Miller DJ, Marin F. The skeletal proteome of the coral Acropora millepora: the evolution of calcification by co-option and domain shuffling. Mol Biol Evol 2013; 30:2099-112. [PMID: 23765379 PMCID: PMC3748352 DOI: 10.1093/molbev/mst109] [Citation(s) in RCA: 122] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
In corals, biocalcification is a major function that may be drastically affected by ocean acidification (OA). Scleractinian corals grow by building up aragonitic exoskeletons that provide support and protection for soft tissues. Although this process has been extensively studied, the molecular basis of biocalcification is poorly understood. Notably lacking is a comprehensive catalog of the skeleton-occluded proteins—the skeletal organic matrix proteins (SOMPs) that are thought to regulate the mineral deposition. Using a combination of proteomics and transcriptomics, we report the first survey of such proteins in the staghorn coral Acropora millepora. The organic matrix (OM) extracted from the coral skeleton was analyzed by mass spectrometry and bioinformatics, enabling the identification of 36 SOMPs. These results provide novel insights into the molecular basis of coral calcification and the macroevolution of metazoan calcifying systems, whereas establishing a platform for studying the impact of OA at molecular level. Besides secreted proteins, extracellular regions of transmembrane proteins are also present, suggesting a close control of aragonite deposition by the calicoblastic epithelium. In addition to the expected SOMPs (Asp/Glu-rich, galaxins), the skeletal repertoire included several proteins containing known extracellular matrix domains. From an evolutionary perspective, the number of coral-specific proteins is low, many SOMPs having counterparts in the noncalcifying cnidarians. Extending the comparison with the skeletal OM proteomes of other metazoans allowed the identification of a pool of functional domains shared between phyla. These data suggest that co-option and domain shuffling may be general mechanisms by which the trait of calcification has evolved.
Collapse
Affiliation(s)
- Paula Ramos-Silva
- UMR 6282 CNRS, Biogéosciences, Université de Bourgogne, Dijon, France
| | | | | | | | | | | | | | | |
Collapse
|
13
|
Lopez FJ, Bernabeu M, Fernandez-Becerra C, del Portillo HA. A new computational approach redefines the subtelomeric vir superfamily of Plasmodium vivax. BMC Genomics 2013; 14:8. [PMID: 23324551 PMCID: PMC3566924 DOI: 10.1186/1471-2164-14-8] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2012] [Accepted: 01/02/2013] [Indexed: 01/20/2023] Open
Abstract
Background Subtelomeric multigene families of malaria parasites encode virulent determinants. The published genome sequence of Plasmodium vivax revealed the largest subtelomeric multigene family of human malaria parasites, the vir super-family, presently composed of 346 vir genes subdivided into 12 different subfamilies based on sequence homologies detected by BLAST. Results A novel computational approach was used to redefine vir genes. First, a protein-weighted graph was built based on BLAST alignments. This graph was processed to ensure that edge weights are not exclusively based on the BLAST score between the two corresponding proteins, but strongly dependant on their graph neighbours and their associations. Then the Markov Clustering Algorithm was applied to the protein graph. Next, the Homology Block concept was used to further validate this clustering approach. Finally, proteome-wide analysis was carried out to predict new VIR members. Results showed that (i) three previous subfamilies cannot longer be classified as vir genes; (ii) most previously unclustered vir genes were clustered into vir subfamilies; (iii) 39 hypothetical proteins were predicted as VIR proteins; (iv) many of these findings are supported by a number of structural and functional evidences, sub-cellular localization studies, gene expression analysis and chromosome localization (v) this approach can be used to study other multigene families in malaria. Conclusions This methodology, resource and new classification of vir genes will contribute to a new structural framing of this multigene family and other multigene families of malaria parasites, facilitating the design of experiments to understand their role in pathology, which in turn may help furthering vaccine development.
Collapse
Affiliation(s)
- Francisco Javier Lopez
- Barcelona Centre for International Health Research, CRESIB, Hospital Clínic-Universitat de Barcelona, Roselló 153, 1a planta CEK Building, 08036 Barcelona, Spain
| | | | | | | |
Collapse
|
14
|
Miele V, Penel S, Duret L. Ultra-fast sequence clustering from similarity networks with SiLiX. BMC Bioinformatics 2011; 12:116. [PMID: 21513511 PMCID: PMC3095554 DOI: 10.1186/1471-2105-12-116] [Citation(s) in RCA: 212] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2010] [Accepted: 04/22/2011] [Indexed: 01/04/2023] Open
Abstract
Background The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time. Results We present the software package SiLiX that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity. Conclusions Comparing state-of-the-art software, SiLiX presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. SiLiX is freely available at http://lbbe.univ-lyon1.fr/SiLiX.
Collapse
Affiliation(s)
- Vincent Miele
- Laboratoire Biométrie et Biologie Evolutive, Université de Lyon, Université Lyon 1, CNRS, INRIA, UMR5558, Villeurbanne, France.
| | | | | |
Collapse
|
15
|
Baumbach J. On the power and limits of evolutionary conservation--unraveling bacterial gene regulatory networks. Nucleic Acids Res 2010; 38:7877-84. [PMID: 20699275 PMCID: PMC3001071 DOI: 10.1093/nar/gkq699] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The National Center for Biotechnology Information (NCBI) recently announced ‘1000 prokaryotic genomes are now completed and available in the Genome database’. The increasing trend will provide us with thousands of sequenced microbial organisms over the next years. However, this is only the first step in understanding how cells survive, reproduce and adapt their behavior while being exposed to changing environmental conditions. One major control mechanism is transcriptional gene regulation. Here, striking is the direct juxtaposition of the handful of bacterial model organisms to the 1000 prokaryotic genomes. Next-generation sequencing technologies will further widen this gap drastically. However, several computational approaches have proven to be helpful. The main idea is to use the known transcriptional regulatory network of reference organisms as template in order to unravel evolutionarily conserved gene regulations in newly sequenced species. This transfer essentially depends on the reliable identification of several types of conserved DNA sequences. We decompose this problem into three computational processes, review the state of the art and illustrate future perspectives.
Collapse
Affiliation(s)
- Jan Baumbach
- Algorithms Group, International Computer Science Institute, Berkeley, USA.
| |
Collapse
|