1
|
Benítez-Hidalgo A, Aldana-Montes JF, Navas-Delgado I, Roldán-García MDM. SALON ontology for the formal description of sequence alignments. BMC Bioinformatics 2023; 24:69. [PMID: 36849882 PMCID: PMC9972671 DOI: 10.1186/s12859-023-05190-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 02/15/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Information provided by high-throughput sequencing platforms allows the collection of content-rich data about biological sequences and their context. Sequence alignment is a bioinformatics approach to identifying regions of similarity in DNA, RNA, or protein sequences. However, there is no consensus about the specific common terminology and representation for sequence alignments. Thus, automatically linking the wide existing knowledge about the sequences with the alignments is challenging. RESULTS The Sequence Alignment Ontology (SALON) defines a helpful vocabulary for representing and semantically annotating pairwise and multiple sequence alignments. SALON is an OWL 2 ontology that supports automated reasoning for alignments validation and retrieving complementary information from public databases under the Open Linked Data approach. This will reduce the effort needed by scientists to interpret the sequence alignment results. CONCLUSIONS SALON defines a full range of controlled terminology in the domain of sequence alignments. It can be used as a mediated schema to integrate data from different sources and validate acquired knowledge.
Collapse
Affiliation(s)
- Antonio Benítez-Hidalgo
- Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain. .,University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain. .,Instituto de Investigación Biomédica de Málaga - IBIMA, Málaga, Spain.
| | - José F. Aldana-Montes
- grid.10215.370000 0001 2298 7828Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain ,grid.10215.370000 0001 2298 7828University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain ,grid.452525.1Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
| | - Ismael Navas-Delgado
- grid.10215.370000 0001 2298 7828Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain ,grid.10215.370000 0001 2298 7828University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain ,grid.452525.1Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
| | - María del Mar Roldán-García
- grid.10215.370000 0001 2298 7828Departamento de Lenguajes y Ciencias de la Computación, University of Málaga, Málaga, Spain ,grid.10215.370000 0001 2298 7828University of Málaga, ITIS Software, Ada Byron Research Building, Málaga, Spain ,grid.452525.1Instituto de Investigación Biomédica de Málaga – IBIMA, Málaga, Spain
| |
Collapse
|
2
|
Zhan Q, Fu Y, Jiang Q, Liu B, Peng J, Wang Y. SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically. Protein Pept Lett 2020; 27:295-302. [PMID: 31385760 DOI: 10.2174/0929866526666190806143959] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2019] [Revised: 04/26/2019] [Accepted: 06/14/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. OBJECTIVE In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. METHODS Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. RESULTS We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. CONCLUSION The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yilei Fu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
3
|
Zhan Q, Wang N, Jin S, Tan R, Jiang Q, Wang Y. ProbPFP: a multiple sequence alignment algorithm combining hidden Markov model optimized by particle swarm optimization with partition function. BMC Bioinformatics 2019; 20:573. [PMID: 31760933 PMCID: PMC6876095 DOI: 10.1186/s12859-019-3132-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND During procedures for conducting multiple sequence alignment, that is so essential to use the substitution score of pairwise alignment. To compute adaptive scores for alignment, researchers usually use Hidden Markov Model or probabilistic consistency methods such as partition function. Recent studies show that optimizing the parameters for hidden Markov model, as well as integrating hidden Markov model with partition function can raise the accuracy of alignment. The combination of partition function and optimized HMM, which could further improve the alignment's accuracy, however, was ignored by these researches. RESULTS A novel algorithm for MSA called ProbPFP is presented in this paper. It intergrate optimized HMM by particle swarm with partition function. The algorithm of PSO was applied to optimize HMM's parameters. After that, the posterior probability obtained by the HMM was combined with the one obtained by partition function, and thus to calculate an integrated substitution score for alignment. In order to evaluate the effectiveness of ProbPFP, we compared it with 13 outstanding or classic MSA methods. The results demonstrate that the alignments obtained by ProbPFP got the maximum mean TC scores and mean SP scores on these two benchmark datasets: SABmark and OXBench, and it got the second highest mean TC scores and mean SP scores on the benchmark dataset BAliBASE. ProbPFP is also compared with 4 other outstanding methods, by reconstructing the phylogenetic trees for six protein families extracted from the database TreeFam, based on the alignments obtained by these 5 methods. The result indicates that the reference trees are closer to the phylogenetic trees reconstructed from the alignments obtained by ProbPFP than the other methods. CONCLUSIONS We propose a new multiple sequence alignment method combining optimized HMM and partition function in this paper. The performance validates this method could make a great improvement of the alignment's accuracy.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Nan Wang
- Department of Mathematics, Harbin Institute of Technology, Harbin, 150001, China
| | - Shuilin Jin
- Department of Mathematics, Harbin Institute of Technology, Harbin, 150001, China
| | - Renjie Tan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
4
|
McSkimming DI, Dastgheib S, Baffi TR, Byrne DP, Ferries S, Scott ST, Newton AC, Eyers CE, Kochut KJ, Eyers PA, Kannan N. KinView: a visual comparative sequence analysis tool for integrated kinome research. MOLECULAR BIOSYSTEMS 2016; 12:3651-3665. [PMID: 27731453 PMCID: PMC5508867 DOI: 10.1039/c6mb00466k] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Multiple sequence alignments (MSAs) are a fundamental analysis tool used throughout biology to investigate relationships between protein sequence, structure, function, evolutionary history, and patterns of disease-associated variants. However, their widespread application in systems biology research is currently hindered by the lack of user-friendly tools to simultaneously visualize, manipulate and query the information conceptualized in large sequence alignments, and the challenges in integrating MSAs with multiple orthogonal data such as cancer variants and post-translational modifications, which are often stored in heterogeneous data sources and formats. Here, we present the Multiple Sequence Alignment Ontology (MSAOnt), which represents a profile or consensus alignment in an ontological format. Subsets of the alignment are easily selected through the SPARQL Protocol and RDF Query Language for downstream statistical analysis or visualization. We have also created the Kinome Viewer (KinView), an interactive integrative visualization that places eukaryotic protein kinase cancer variants in the context of natural sequence variation and experimentally determined post-translational modifications, which play central roles in the regulation of cellular signaling pathways. Using KinView, we identified differential phosphorylation patterns between tyrosine and serine/threonine kinases in the activation segment, a major kinase regulatory region that is often mutated in proliferative diseases. We discuss cancer variants that disrupt phosphorylation sites in the activation segment, and show how KinView can be used as a comparative tool to identify differences and similarities in natural variation, cancer variants and post-translational modifications between kinase groups, families and subfamilies. Based on KinView comparisons, we identify and experimentally characterize a regulatory tyrosine (Y177PLK4) in the PLK4 C-terminal activation segment region termed the P+1 loop. To further demonstrate the application of KinView in hypothesis generation and testing, we formulate and validate a hypothesis explaining a novel predicted loss-of-function variant (D523NPKCβ) in the regulatory spine of PKCβ, a recently identified tumor suppressor kinase. KinView provides a novel, extensible interface for performing comparative analyses between subsets of kinases and for integrating multiple types of residue specific annotations in user friendly formats.
Collapse
Affiliation(s)
| | - Shima Dastgheib
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Timothy R Baffi
- Department of Pharmacology, University of California at San Diego, La Jolla, CA 92093, USA
| | - Dominic P Byrne
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Samantha Ferries
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Steven Thomas Scott
- Department of Biochemistry & Molecular Biology, University of Georgia, Athens, GA 30602, USA
| | - Alexandra C Newton
- Department of Pharmacology, University of California at San Diego, La Jolla, CA 92093, USA
| | - Claire E Eyers
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Krzysztof J Kochut
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Patrick A Eyers
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Natarajan Kannan
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA. and Department of Biochemistry & Molecular Biology, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
5
|
Stoddard CD, Widmann J, Trausch JJ, Marcano-Velázquez JG, Knight R, Batey RT. Nucleotides adjacent to the ligand-binding pocket are linked to activity tuning in the purine riboswitch. J Mol Biol 2013; 425:1596-611. [PMID: 23485418 DOI: 10.1016/j.jmb.2013.02.023] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2012] [Revised: 01/31/2013] [Accepted: 02/02/2013] [Indexed: 12/20/2022]
Abstract
Direct sensing of intracellular metabolite concentrations by riboswitch RNAs provides an economical and rapid means to maintain metabolic homeostasis. Since many organisms employ the same class of riboswitch to control different genes or transcription units, it is likely that functional variation exists in riboswitches such that activity is tuned to meet cellular needs. Using a bioinformatic approach, we have identified a region of the purine riboswitch aptamer domain that displays conservation patterns linked to riboswitch activity. Aptamer domain compositions within this region can be divided into nine classes that display a spectrum of activities. Naturally occurring compositions in this region favor rapid association rate constants and slow dissociation rate constants for ligand binding. Using X-ray crystallography and chemical probing, we demonstrate that both the free and bound states are influenced by the composition of this region and that modest sequence alterations have a dramatic impact on activity. The introduction of non-natural compositions result in the inability to regulate gene expression in vivo, suggesting that aptamer domain activity is highly plastic and thus readily tunable to meet cellular needs.
Collapse
Affiliation(s)
- Colby D Stoddard
- Department of Chemistry and Biochemistry, 596 UCB, University of Colorado, Boulder, CO 80309-0596, USA
| | | | | | | | | | | |
Collapse
|
6
|
Widmann J, Stombaugh J, McDonald D, Chocholousova J, Gardner P, Iyer MK, Liu Z, Lozupone CA, Quinn J, Smit S, Wikman S, Zaneveld JR, Knight R. RNASTAR: an RNA STructural Alignment Repository that provides insight into the evolution of natural and artificial RNAs. RNA (NEW YORK, N.Y.) 2012; 18:1319-27. [PMID: 22645380 PMCID: PMC3383963 DOI: 10.1261/rna.032052.111] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Automated RNA alignment algorithms often fail to recapture the essential conserved sites that are critical for function. To assist in the refinement of these algorithms, we manually curated a set of 148 alignments with a total of 9600 unique sequences, in which each alignment was backed by at least one crystal or NMR structure. These alignments included both naturally and artificially selected molecules. We used principles of isostericity to improve the alignments from an average of 83%-94% isosteric base pairs. We expect that this alignment collection will assist in a wide range of benchmarking efforts and provide new insight into evolutionary principles governing change in RNA structural motifs. The improved alignments have been contributed to the Rfam database.
Collapse
Affiliation(s)
- Jeremy Widmann
- Department of Chemistry and Biochemistry, University of Colorado at Boulder, Boulder, Colorado 80309, USA
| | - Jesse Stombaugh
- Department of Chemistry and Biochemistry, University of Colorado at Boulder, Boulder, Colorado 80309, USA
| | - Daniel McDonald
- Biofrontiers Institute, University of Colorado at Boulder, Boulder, Colorado 80309, USA
| | - Jana Chocholousova
- Institute of Organic Chemistry and Biochemistry, Academy of Sciences of the Czech Republic, Prague 6, Czech Republic
| | - Paul Gardner
- School of Biological Sciences, University of Canterbury, Christchurch 8140, New Zealand
| | - Matthew K. Iyer
- Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Zongzhi Liu
- Department of Pathology Informatics, School of Medicine, Yale University, New Haven, Connecticut 06510, USA
| | - Catherine A. Lozupone
- Department of Chemistry and Biochemistry, University of Colorado at Boulder, Boulder, Colorado 80309, USA
| | - John Quinn
- Thermo Fisher Scientific, Lafayette, Colorado 80026, USA
| | - Sandra Smit
- Laboratory of Bioinformatics, Wageningen University, 6700 AN Wageningen, The Netherlands
| | | | - Jesse R.R. Zaneveld
- Department of Microbiology, Oregon State University, Corvallis, Oregon 97331, USA
| | - Rob Knight
- Department of Chemistry and Biochemistry, University of Colorado at Boulder, Boulder, Colorado 80309, USA
- Howard Hughes Medical Institute, Boulder, Colorado 80309, USA
- Corresponding authorE-mail
| |
Collapse
|
7
|
Thompson JD, Linard B, Lecompte O, Poch O. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One 2011; 6:e18093. [PMID: 21483869 PMCID: PMC3069049 DOI: 10.1371/journal.pone.0018093] [Citation(s) in RCA: 129] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2010] [Accepted: 02/21/2011] [Indexed: 12/18/2022] Open
Abstract
Multiple comparison or alignmentof protein sequences has become a fundamental tool in many different domains in modern molecular biology, from evolutionary studies to prediction of 2D/3D structure, molecular function and inter-molecular interactions etc. By placing the sequence in the framework of the overall family, multiple alignments can be used to identify conserved features and to highlight differences or specificities. In this paper, we describe a comprehensive evaluation of many of the most popular methods for multiple sequence alignment (MSA), based on a new benchmark test set. The benchmark is designed to represent typical problems encountered when aligning the large protein sequence sets that result from today's high throughput biotechnologies. We show that alignmentmethods have significantly progressed and can now identify most of the shared sequence features that determine the broad molecular function(s) of a protein family, even for divergent sequences. However,we have identified a number of important challenges. First, the locally conserved regions, that reflect functional specificities or that modulate a protein's function in a given cellular context,are less well aligned. Second, motifs in natively disordered regions are often misaligned. Third, the badly predicted or fragmentary protein sequences, which make up a large proportion of today's databases, lead to a significant number of alignment errors. Based on this study, we demonstrate that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions. We then propose knowledge-enabled, dynamic solutions that will hopefully pave the way to enhanced alignment construction and exploitation in future evolutionary systems biology studies.
Collapse
Affiliation(s)
- Julie D Thompson
- Département de Biologie Structurale et Génomique, IGBMC (Institut de Génétique et de Biologie Moléculaire et Cellulaire), CNRS/INSERM/Université de Strasbourg, Illkirch, France.
| | | | | | | |
Collapse
|
8
|
Friedrich A, Garnier N, Gagnière N, Nguyen H, Albou LP, Biancalana V, Bettler E, Deléage G, Lecompte O, Muller J, Moras D, Mandel JL, Toursel T, Moulinier L, Poch O. SM2PH-db: an interactive system for the integrated analysis of phenotypic consequences of missense mutations in proteins involved in human genetic diseases. Hum Mutat 2010; 31:127-35. [PMID: 19921752 DOI: 10.1002/humu.21155] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Understanding how genetic alterations affect gene products at the molecular level represents a first step in the elucidation of the complex relationships between genotypic and phenotypic variations, and is thus a major challenge in the postgenomic era. Here, we present SM2PH-db (http://decrypthon.igbmc.fr/sm2ph), a new database designed to investigate structural and functional impacts of missense mutations and their phenotypic effects in the context of human genetic diseases. A wealth of up-to-date interconnected information is provided for each of the 2,249 disease-related entry proteins (August 2009), including data retrieved from biological databases and data generated from a Sequence-Structure-Evolution Inference in Systems-based approach, such as multiple alignments, three-dimensional structural models, and multidimensional (physicochemical, functional, structural, and evolutionary) characterizations of mutations. SM2PH-db provides a robust infrastructure associated with interactive analysis tools supporting in-depth study and interpretation of the molecular consequences of mutations, with the more long-term goal of elucidating the chain of events leading from a molecular defect to its pathology. The entire content of SM2PH-db is regularly and automatically updated thanks to a computational grid data federation facilities provided in the context of the Decrypthon program.
Collapse
Affiliation(s)
- Anne Friedrich
- Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire (UMR7104), Centre National de la Recherche Scientifique/Institut National de la Santé et de la Recherche Médicale/Université de Strasbourg, Illkirch, France
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Brown JW, Birmingham A, Griffiths PE, Jossinet F, Kachouri-Lafond R, Knight R, Lang BF, Leontis N, Steger G, Stombaugh J, Westhof E. The RNA structure alignment ontology. RNA (NEW YORK, N.Y.) 2009; 15:1623-31. [PMID: 19622678 PMCID: PMC2743057 DOI: 10.1261/rna.1601409] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2009] [Accepted: 05/26/2009] [Indexed: 05/19/2023]
Abstract
Multiple sequence alignments are powerful tools for understanding the structures, functions, and evolutionary histories of linear biological macromolecules (DNA, RNA, and proteins), and for finding homologs in sequence databases. We address several ontological issues related to RNA sequence alignments that are informed by structure. Multiple sequence alignments are usually shown as two-dimensional (2D) matrices, with rows representing individual sequences, and columns identifying nucleotides from different sequences that correspond structurally, functionally, and/or evolutionarily. However, the requirement that sequences and structures correspond nucleotide-by-nucleotide is unrealistic and hinders representation of important biological relationships. High-throughput sequencing efforts are also rapidly making 2D alignments unmanageable because of vertical and horizontal expansion as more sequences are added. Solving the shortcomings of traditional RNA sequence alignments requires explicit annotation of the meaning of each relationship within the alignment. We introduce the notion of "correspondence," which is an equivalence relation between RNA elements in sets of sequences as the basis of an RNA alignment ontology. The purpose of this ontology is twofold: first, to enable the development of new representations of RNA data and of software tools that resolve the expansion problems with current RNA sequence alignments, and second, to facilitate the integration of sequence data with secondary and three-dimensional structural information, as well as other experimental information, to create simultaneously more accurate and more exploitable RNA alignments.
Collapse
|
10
|
Prosdocimi F, Chisham B, Pontelli E, Thompson JD, Stoltzfus A. Initial implementation of a comparative data analysis ontology. Evol Bioinform Online 2009; 5:47-66. [PMID: 19812726 PMCID: PMC2747124 DOI: 10.4137/ebo.s2320] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Comparative analysis is used throughout biology. When entities under comparison (e.g. proteins, genomes, species) are related by descent, evolutionary theory provides a framework that, in principle, allows N-ary comparisons of entities, while controlling for non-independence due to relatedness. Powerful software tools exist for specialized applications of this approach, yet it remains under-utilized in the absence of a unifying informatics infrastructure. A key step in developing such an infrastructure is the definition of a formal ontology. The analysis of use cases and existing formalisms suggests that a significant component of evolutionary analysis involves a core problem of inferring a character history, relying on key concepts: “Operational Taxonomic Units” (OTUs), representing the entities to be compared; “character-state data” representing the observations compared among OTUs; “phylogenetic tree”, representing the historical path of evolution among the entities; and “transitions”, the inferred evolutionary changes in states of characters that account for observations. Using the Web Ontology Language (OWL), we have defined these and other fundamental concepts in a Comparative Data Analysis Ontology (CDAO). CDAO has been evaluated for its ability to represent token data sets and to support simple forms of reasoning. With further development, CDAO will provide a basis for tools (for semantic transformation, data retrieval, validation, integration, etc.) that make it easier for software developers and biomedical researchers to apply evolutionary methods of inference to diverse types of data, so as to integrate this powerful framework for reasoning into their research.
Collapse
Affiliation(s)
- Francisco Prosdocimi
- Department of Structural Biology and Genomics, Institut de Génétique et de Biologie Moléculaire et Cellulaire (IGBMC), F-67400 Illkirch, France
| | | | | | | | | |
Collapse
|
11
|
Antezana E, Kuiper M, Mironov V. Biological knowledge management: the emerging role of the Semantic Web technologies. Brief Bioinform 2009; 10:392-407. [PMID: 19457869 DOI: 10.1093/bib/bbp024] [Citation(s) in RCA: 83] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
New knowledge is produced at a continuously increasing speed, and the list of papers, databases and other knowledge sources that a researcher in the life sciences needs to cope with is actually turning into a problem rather than an asset. The adequate management of knowledge is therefore becoming fundamentally important for life scientists, especially if they work with approaches that thoroughly depend on knowledge integration, such as systems biology. Several initiatives to organize biological knowledge sources into a readily exploitable resourceome are presently being carried out. Ontologies and Semantic Web technologies revolutionize these efforts. Here, we review the benefits, trends, current possibilities, and the potential this holds for the biosciences.
Collapse
Affiliation(s)
- Erick Antezana
- Department of Biology at the Norwegian University of Science and Technology
| | | | | |
Collapse
|
12
|
Levasseur A, Pontarotti P, Poch O, Thompson JD. Strategies for reliable exploitation of evolutionary concepts in high throughput biology. Evol Bioinform Online 2008; 4:121-37. [PMID: 19204813 PMCID: PMC2614184 DOI: 10.4137/ebo.s597] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
The recent availability of the complete genome sequences of a large number of model organisms, together with the immense amount of data being produced by the new high-throughput technologies, means that we can now begin comparative analyses to understand the mechanisms involved in the evolution of the genome and their consequences in the study of biological systems. Phylogenetic approaches provide a unique conceptual framework for performing comparative analyses of all this data, for propagating information between different systems and for predicting or inferring new knowledge. As a result, phylogeny-based inference systems are now playing an increasingly important role in most areas of high throughput genomics, including studies of promoters (phylogenetic footprinting), interactomes (based on the presence and degree of conservation of interacting proteins), and in comparisons of transcriptomes or proteomes (phylogenetic proximity and co-regulation/co-expression). Here we review the recent developments aimed at making automatic, reliable phylogeny-based inference feasible in large-scale projects. We also discuss how evolutionary concepts and phylogeny-based inference strategies are now being exploited in order to understand the evolution and function of biological systems. Such advances will be fundamental for the success of the emerging disciplines of systems biology and synthetic biology, and will have wide-reaching effects in applied fields such as biotechnology, medicine and pharmacology.
Collapse
Affiliation(s)
- Anthony Levasseur
- Phylogenomics Laboratory, EA 3781 Evolution Biologique, Université de Provence, 13331 Marseille, France
| | | | | | | |
Collapse
|
13
|
Abstract
Protein sequence alignment is the task of identifying evolutionarily or structurally related positions in a collection of amino acid sequences. Although the protein alignment problem has been studied for several decades, many recent studies have demonstrated considerable progress in improving the accuracy or scalability of multiple and pairwise alignment tools, or in expanding the scope of tasks handled by an alignment program. In this chapter, we review state-of-the-art protein sequence alignment and provide practical advice for users of alignment tools.
Collapse
Affiliation(s)
- Chuong B Do
- Computer Science Department, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
14
|
Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O. MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics 2006; 7:318. [PMID: 16792820 PMCID: PMC1539025 DOI: 10.1186/1471-2105-7-318] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2006] [Accepted: 06/23/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the post-genomic era, systems-level studies are being performed that seek to explain complex biological systems by integrating diverse resources from fields such as genomics, proteomics or transcriptomics. New information management systems are now needed for the collection, validation and analysis of the vast amount of heterogeneous data available. Multiple alignments of complete sequences provide an ideal environment for the integration of this information in the context of the protein family. RESULTS MACSIMS is a multiple alignment-based information management program that combines the advantages of both knowledge-based and ab initio sequence analysis methods. Structural and functional information is retrieved automatically from the public databases. In the multiple alignment, homologous regions are identified and the retrieved data is evaluated and propagated from known to unknown sequences with these reliable regions. In a large-scale evaluation, the specificity of the propagated sequence features is estimated to be >99%, i.e. very few false positive predictions are made. MACSIMS is then used to characterise mutations in a test set of 100 proteins that are known to be involved in human genetic diseases. The number of sequence features associated with these proteins was increased by 60%, compared to the features available in the public databases. An XML format output file allows automatic parsing of the MACSIM results, while a graphical display using the JalView program allows manual analysis. CONCLUSION MACSIMS is a new information management system that incorporates detailed analyses of protein families at the structural, functional and evolutionary levels. MACSIMS thus provides a unique environment that facilitates knowledge extraction and the presentation of the most pertinent information to the biologist. A web server and the source code are available at http://bips.u-strasbg.fr/MACSIMS/.
Collapse
Affiliation(s)
- Julie D Thompson
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| | - Arnaud Muller
- The Laboratory of Molecular Biology, Genetic Analysis & Modelling, Luxembourg
| | - Andrew Waterhouse
- Post Genomics & Molecular Interactions Centre, School of Life Sciences, University of Dundee, UK
| | - Jim Procter
- Post Genomics & Molecular Interactions Centre, School of Life Sciences, University of Dundee, UK
| | - Geoffrey J Barton
- Post Genomics & Molecular Interactions Centre, School of Life Sciences, University of Dundee, UK
| | - Frédéric Plewniak
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| | - Olivier Poch
- Laboratoire de Biologie et Genomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| |
Collapse
|
15
|
Leontis NB, Altman RB, Berman HM, Brenner SE, Brown JW, Engelke DR, Harvey SC, Holbrook SR, Jossinet F, Lewis SE, Major F, Mathews DH, Richardson JS, Williamson JR, Westhof E. The RNA Ontology Consortium: an open invitation to the RNA community. RNA (NEW YORK, N.Y.) 2006; 12:533-41. [PMID: 16484377 PMCID: PMC1421088 DOI: 10.1261/rna.2343206] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
The aim of the RNA Ontology Consortium (ROC) is to create an integrated conceptual framework-an RNA Ontology (RO)-with a common, dynamic, controlled, and structured vocabulary to describe and characterize RNA sequences, secondary structures, three-dimensional structures, and dynamics pertaining to RNA function. The RO should produce tools for clear communication about RNA structure and function for multiple uses, including the integration of RNA electronic resources into the Semantic Web. These tools should allow the accurate description in computer-interpretable form of the coupling between RNA architecture, function, and evolution. The purposes for creating the RO are, therefore, (1) to integrate sequence and structural databases; (2) to allow different computational tools to interoperate; (3) to create powerful software tools that bring advanced computational methods to the bench scientist; and (4) to facilitate precise searches for all relevant information pertaining to RNA. For example, one initial objective of the ROC is to define, identify, and classify RNA structural motifs described in the literature or appearing in databases and to agree on a computer-interpretable definition for each of these motifs. To achieve these aims, the ROC will foster communication and promote collaboration among RNA scientists by coordinating frequent face-to-face workshops to discuss, debate, and resolve difficult conceptual issues. These meeting opportunities will create new directions at various levels of RNA research. The ROC will work closely with the PDB/NDB structural databases and the Gene, Sequence, and Open Biomedical Ontology Consortia to integrate the RO with existing biological ontologies to extend existing content while maintaining interoperability.
Collapse
|
16
|
Antezana E, Tsiporkova E, Mironov V, Kuiper M. A Cell-Cycle Knowledge Integration Framework. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11799511_4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|