1
|
Colom MS, Vucinic J, Adolf-Bryfogle J, Bowman JW, Verel S, Moczygemba I, Schiex T, Simoncini D, Bahl CD. Complete Combinatorial Mutational Enumeration of a protein functional site enables sequence-landscape mapping and identifies highly-mutated variants that retain activity. Res Sq 2023:rs.3.rs-2248327. [PMID: 36482980 PMCID: PMC9727770 DOI: 10.21203/rs.3.rs-2248327/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Understanding how proteins evolve under selective pressure is a longstanding challenge. The immensity of the search space has limited efforts to systematically evaluate the impact of multiple simultaneous mutations, so mutations have typically been assessed individually. However, epistasis, or the way in which mutations interact, prevents accurate prediction of combinatorial mutations based on measurements of individual mutations. Here, we use artificial intelligence to define the entire functional sequence landscape of a protein binding site in silico, and we call this approach Complete Combinatorial Mutational Enumeration (CCME). By leveraging CCME, we are able to construct a comprehensive map of the evolutionary connectivity within this functional sequence landscape. As a proof of concept, we applied CCME to the ACE2 binding site of the SARS-CoV-2 spike protein receptor binding domain. We selected representative variants from across the functional sequence landscape for testing in the laboratory. We identified variants that retained functionality to bind ACE2 despite changing over 40% of evaluated residue positions, and the variants now escape binding and neutralization by monoclonal antibodies. This work represents a crucial initial stride towards achieving precise predictions of pathogen evolution, opening avenues for proactive mitigation.
Collapse
Affiliation(s)
- Mireia Solà Colom
- Institute for Protein Innovation; Boston, Massachusetts, 02115, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Harvard Medical School; Boston, Massachusetts, 02115, USA
- current address: AI Proteins; Boston, Massachusetts, 02215, USA
| | - Jelena Vucinic
- Université Fédérale de Toulouse; ANITI, IRIT-CNRS UMR 5505, Université Toulouse Capitole, 31000 Toulouse, France
| | - Jared Adolf-Bryfogle
- Institute for Protein Innovation; Boston, Massachusetts, 02115, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Harvard Medical School; Boston, Massachusetts, 02115, USA
| | - James W. Bowman
- Institute for Protein Innovation; Boston, Massachusetts, 02115, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Harvard Medical School; Boston, Massachusetts, 02115, USA
- current address: AI Proteins; Boston, Massachusetts, 02215, USA
| | - Sébastien Verel
- Université Littoral Côte d’Opale; UR 4491, LISIC, F-62100 Calais, France
| | - Isabelle Moczygemba
- Institute for Protein Innovation; Boston, Massachusetts, 02115, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Harvard Medical School; Boston, Massachusetts, 02115, USA
- current address: AI Proteins; Boston, Massachusetts, 02215, USA
| | - Thomas Schiex
- Université Fédérale de Toulouse; ANITI, INRAE-UR 875, 31000 Toulouse, France
| | - David Simoncini
- Université Fédérale de Toulouse; ANITI, IRIT-CNRS UMR 5505, Université Toulouse Capitole, 31000 Toulouse, France
| | - Christopher D. Bahl
- Institute for Protein Innovation; Boston, Massachusetts, 02115, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Harvard Medical School; Boston, Massachusetts, 02115, USA
- current address: AI Proteins; Boston, Massachusetts, 02215, USA
| |
Collapse
|
2
|
Cohen H, Hoede C, Scharte F, Coluzzi C, Cohen E, Shomer I, Mallet L, Holbert S, Serre RF, Schiex T, Virlogeux-Payant I, Grassl GA, Hensel M, Chiapello H, Gal-Mor O. Intracellular Salmonella Paratyphi A is motile and differs in the expression of flagella-chemotaxis, SPI-1 and carbon utilization pathways in comparison to intracellular S. Typhimurium. PLoS Pathog 2022; 18:e1010425. [PMID: 35381053 PMCID: PMC9012535 DOI: 10.1371/journal.ppat.1010425] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 04/15/2022] [Accepted: 03/09/2022] [Indexed: 12/21/2022] Open
Abstract
Although Salmonella Typhimurium (STM) and Salmonella Paratyphi A (SPA) belong to the same phylogenetic species, share large portions of their genome and express many common virulence factors, they differ vastly in their host specificity, the immune response they elicit, and the clinical manifestations they cause. In this work, we compared their intracellular transcriptomic architecture and cellular phenotypes during human epithelial cell infection. While transcription induction of many metal transport systems, purines, biotin, PhoPQ and SPI-2 regulons was similar in both intracellular SPA and STM, we identified 234 differentially expressed genes that showed distinct expression patterns in intracellular SPA vs. STM. Surprisingly, clear expression differences were found in SPI-1, motility and chemotaxis, and carbon (mainly citrate, galactonate and ethanolamine) utilization pathways, indicating that these pathways are regulated differently during their intracellular phase. Concurring, on the cellular level, we show that while the majority of STM are non-motile and reside within Salmonella-Containing Vacuoles (SCV), a significant proportion of intracellular SPA cells are motile and compartmentalized in the cytosol. Moreover, we found that the elevated expression of SPI-1 and motility genes by intracellular SPA results in increased invasiveness of SPA, following exit from host cells. These findings demonstrate unexpected flagellum-dependent intracellular motility of a typhoidal Salmonella serovar and intriguing differences in intracellular localization between typhoidal and non-typhoidal salmonellae. We propose that these differences facilitate new cycles of host cell infection by SPA and may contribute to the ability of SPA to disseminate beyond the intestinal lamina propria of the human host during enteric fever. Salmonella enterica is a ubiquitous, facultative intracellular animal and human pathogen. Although non-typhoidal Salmonella (NTS) and typhoidal Salmonella serovars belong to the same phylogenetic species and share many virulence factors, the disease they cause in humans is very different. While the underlying mechanisms for these differences are not fully understood, one possible reason expected to contribute to their different pathogenicity is a distinct expression pattern of genes involved in host-pathogen interactions. Here, we compared the global gene expression and intracellular phenotypes, during human epithelial cell infection of S. Paratyphi A (SPA) and S. Typhimurium (STM), as prototypical serovars of typhoidal and NTS, respectively. Interestingly, we identified different expression patterns in key virulence and metabolic pathways, cytosolic motility and increased reinvasion of SPA, following exit from infected cells. We hypothesize that these differences contribute to the invasive and systemic disease developed following SPA infection in humans.
Collapse
Affiliation(s)
- Helit Cohen
- The Infectious Diseases Research Laboratory, Sheba Medical Center, Tel-Hashomer, Israel
| | - Claire Hoede
- Université Fédérale de Toulouse, INRAE, BioinfOmics, UR MIAT, GenoToul Bioinformatics facility, 31326, Castanet-Tolosan, France
| | - Felix Scharte
- Abt. Mikrobiologie, Universität Osnabrück, Osnabrück, Germany
| | - Charles Coluzzi
- INRAE, Université Paris-Saclay, MaIAGE, Jouy-en-Josas, France
| | - Emiliano Cohen
- The Infectious Diseases Research Laboratory, Sheba Medical Center, Tel-Hashomer, Israel
| | - Inna Shomer
- The Infectious Diseases Research Laboratory, Sheba Medical Center, Tel-Hashomer, Israel
| | - Ludovic Mallet
- Université Fédérale de Toulouse, INRAE, BioinfOmics, UR MIAT, GenoToul Bioinformatics facility, 31326, Castanet-Tolosan, France
| | | | | | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, Toulouse, France
| | | | - Guntram A. Grassl
- Institute of Medical Microbiology and Hospital Epidemiology, Hannover Medical School and German Center for Infection Research (DZIF), Hanover, Germany
| | - Michael Hensel
- Abt. Mikrobiologie, Universität Osnabrück, Osnabrück, Germany
- CellNanOs–Center of Cellular Nanoanalytics Osnabrück, Universität Osnabrück, Osnabrück, Germany
- * E-mail: (MH); (HC); (OG-M)
| | - Hélène Chiapello
- Université Fédérale de Toulouse, INRAE, BioinfOmics, UR MIAT, GenoToul Bioinformatics facility, 31326, Castanet-Tolosan, France
- INRAE, Université Paris-Saclay, MaIAGE, Jouy-en-Josas, France
- * E-mail: (MH); (HC); (OG-M)
| | - Ohad Gal-Mor
- The Infectious Diseases Research Laboratory, Sheba Medical Center, Tel-Hashomer, Israel
- Department of Clinical Microbiology and Immunology, Faculty of Medicine, Tel-Aviv University, Tel-Aviv, Israel
- * E-mail: (MH); (HC); (OG-M)
| |
Collapse
|
3
|
Bouchiba Y, Ruffini M, Schiex T, Barbe S. Computational Design of Miniprotein Binders. Methods Mol Biol 2022; 2405:361-382. [PMID: 35298822 DOI: 10.1007/978-1-0716-1855-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Miniprotein binders hold a great interest as a class of drugs that bridges the gap between monoclonal antibodies and small molecule drugs. Like monoclonal antibodies, they can be designed to bind to therapeutic targets with high affinity, but they are more stable and easier to produce and to administer. In this chapter, we present a structure-based computational generic approach for miniprotein inhibitor design. Specifically, we describe step-by-step the implementation of the approach for the design of miniprotein binders against the SARS-CoV-2 coronavirus, using available structural data on the SARS-CoV-2 spike receptor binding domain (RBD) in interaction with its native target, the human receptor ACE2. Structural data being increasingly accessible around many protein-protein interaction systems, this method might be applied to the design of miniprotein binders against numerous therapeutic targets. The computational pipeline exploits provable and deterministic artificial intelligence-based protein design methods, with some recent additions in terms of binding energy estimation, multistate design and diverse library generation.
Collapse
Affiliation(s)
- Younes Bouchiba
- TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse, France
| | - Manon Ruffini
- TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse, France
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, Toulouse, France
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, Toulouse, France
| | - Sophie Barbe
- TBI, Université de Toulouse, CNRS, INRAE, INSA, ANITI, Toulouse, France.
| |
Collapse
|
4
|
Abstract
Computational Protein Design (CPD) has produced impressive results for engineering new proteins, resulting in a wide variety of applications. In the past few years, various efforts have aimed at replacing or improving existing design methods using Deep Learning technology to leverage the amount of publicly available protein data. Deep Learning (DL) is a very powerful tool to extract patterns from raw data, provided that data are formatted as mathematical objects and the architecture processing them is well suited to the targeted problem. In the case of protein data, specific representations are needed for both the amino acid sequence and the protein structure in order to capture respectively 1D and 3D information. As no consensus has been reached about the most suitable representations, this review describes the representations used so far, discusses their strengths and weaknesses, and details their associated DL architecture for design and related tasks.
Collapse
Affiliation(s)
- Marianne Defresne
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31077 Toulouse, France; (M.D.); (S.B.)
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France
| |
Collapse
|
5
|
Yagi S, Padhi AK, Vucinic J, Barbe S, Schiex T, Nakagawa R, Simoncini D, Zhang KYJ, Tagami S. Seven Amino Acid Types Suffice to Create the Core Fold of RNA Polymerase. J Am Chem Soc 2021; 143:15998-16006. [PMID: 34559526 DOI: 10.1021/jacs.1c05367] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The extant complex proteins must have evolved from ancient short and simple ancestors. The double-ψ β-barrel (DPBB) is one of the oldest protein folds and conserved in various fundamental enzymes, such as the core domain of RNA polymerase. Here, by reverse engineering a modern DPBB domain, we reconstructed its plausible evolutionary pathway started by "interlacing homodimerization" of a half-size peptide, followed by gene duplication and fusion. Furthermore, by simplifying the amino acid repertoire of the peptide, we successfully created the DPBB fold with only seven amino acid types (Ala, Asp, Glu, Gly, Lys, Arg, and Val), which can be coded by only GNN and ARR (R = A or G) codons in the modern translation system. Thus, the DPBB fold could have been materialized by the early translation system and genetic code.
Collapse
Affiliation(s)
- Sota Yagi
- RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Aditya K Padhi
- RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Jelena Vucinic
- Université Fédérale de Toulouse, ANITI, INRAE-UR 875, 31000 Toulouse, France.,TBI, Université Fédérale de Toulouse, CNRS, INRAE, INSA, ANITI, 31000 Toulouse, France.,Université Fédérale de Toulouse, ANITI, IRIT-UMR 5505, 31000 Toulouse, France
| | - Sophie Barbe
- TBI, Université Fédérale de Toulouse, CNRS, INRAE, INSA, ANITI, 31000 Toulouse, France
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE-UR 875, 31000 Toulouse, France
| | - Reiko Nakagawa
- RIKEN Center for Biosystems Dynamics Research, 2-2-3 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo 650-0047, Japan
| | - David Simoncini
- Université Fédérale de Toulouse, ANITI, IRIT-UMR 5505, 31000 Toulouse, France
| | - Kam Y J Zhang
- RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| | - Shunsuke Tagami
- RIKEN Center for Biosystems Dynamics Research, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan
| |
Collapse
|
6
|
Beuvin F, de Givry S, Schiex T, Verel S, Simoncini D. Iterated local search with partition crossover for computational protein design. Proteins 2021; 89:1522-1529. [PMID: 34228826 DOI: 10.1002/prot.26174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 05/25/2021] [Indexed: 11/06/2022]
Abstract
Structure-based computational protein design (CPD) refers to the problem of finding a sequence of amino acids which folds into a specific desired protein structure, and possibly fulfills some targeted biochemical properties. Recent studies point out the particularly rugged CPD energy landscape, suggesting that local search optimization methods should be designed and tuned to easily escape local minima attraction basins. In this article, we analyze the performance and search dynamics of an iterated local search (ILS) algorithm enhanced with partition crossover. Our algorithm, PILS, quickly finds local minima and escapes their basins of attraction by solution perturbation. Additionally, the partition crossover operator exploits the structure of the residue interaction graph in order to efficiently mix solutions and find new unexplored basins. Our results on a benchmark of 30 proteins of various topology and size show that PILS consistently finds lower energy solutions compared to Rosetta fixbb and a classic ILS, and that the corresponding sequences are mostly closer to the native.
Collapse
Affiliation(s)
- François Beuvin
- IRIT UMR 5505-CNRS, Université de Toulouse I Capitole, Toulouse, France.,Artificial and Natural Intelligence Toulouse Institute, ANITI, Toulouse, France
| | - Simon de Givry
- Artificial and Natural Intelligence Toulouse Institute, ANITI, Toulouse, France.,MIAT, Université de Toulouse, INRAE, UR 875, Toulouse, France
| | - Thomas Schiex
- Artificial and Natural Intelligence Toulouse Institute, ANITI, Toulouse, France.,MIAT, Université de Toulouse, INRAE, UR 875, Toulouse, France
| | | | - David Simoncini
- IRIT UMR 5505-CNRS, Université de Toulouse I Capitole, Toulouse, France.,Artificial and Natural Intelligence Toulouse Institute, ANITI, Toulouse, France
| |
Collapse
|
7
|
Vucinic J, Novikov G, Montanier CY, Dumon C, Schiex T, Barbe S. A Comparative Study to Decipher the Structural and Dynamics Determinants Underlying the Activity and Thermal Stability of GH-11 Xylanases. Int J Mol Sci 2021; 22:ijms22115961. [PMID: 34073139 PMCID: PMC8199483 DOI: 10.3390/ijms22115961] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 05/27/2021] [Accepted: 05/28/2021] [Indexed: 11/23/2022] Open
Abstract
With the growing need for renewable sources of energy, the interest for enzymes capable of biomass degradation has been increasing. In this paper, we consider two different xylanases from the GH-11 family: the particularly active GH-11 xylanase from Neocallimastix patriciarum, NpXyn11A, and the hyper-thermostable mutant of the environmentally isolated GH-11 xylanase, EvXyn11TS. Our aim is to identify the molecular determinants underlying the enhanced capacities of these two enzymes to ultimately graft the abilities of one on the other. Molecular dynamics simulations of the respective free-enzymes and enzyme–xylohexaose complexes were carried out at temperatures of 300, 340, and 500 K. An in-depth analysis of these MD simulations showed how differences in dynamics influence the activity and stability of these two enzymes and allowed us to study and understand in greater depth the molecular and structural basis of these two systems. In light of the results presented in this paper, the thumb region and the larger substrate binding cleft of NpXyn11A seem to play a major role on the activity of this enzyme. Its lower thermal stability may instead be caused by the higher flexibility of certain regions located further from the active site. Regions such as the N-ter, the loops located in the fingers region, the palm loop, and the helix loop seem to be less stable than in the hyper-thermostable EvXyn11TS. By identifying molecular regions that are critical for the stability of these enzymes, this study allowed us to identify promising targets for engineering GH-11 xylanases. Eventually, we identify NpXyn11A as the ideal host for grafting the thermostabilizing traits of EvXyn11TS.
Collapse
Affiliation(s)
- Jelena Vucinic
- Toulouse Biotechnology Institute (TBI), Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31400 Toulouse, France; (J.V.); (G.N.); (C.Y.M.); (C.D.)
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France;
| | - Gleb Novikov
- Toulouse Biotechnology Institute (TBI), Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31400 Toulouse, France; (J.V.); (G.N.); (C.Y.M.); (C.D.)
| | - Cédric Y. Montanier
- Toulouse Biotechnology Institute (TBI), Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31400 Toulouse, France; (J.V.); (G.N.); (C.Y.M.); (C.D.)
| | - Claire Dumon
- Toulouse Biotechnology Institute (TBI), Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31400 Toulouse, France; (J.V.); (G.N.); (C.Y.M.); (C.D.)
| | - Thomas Schiex
- Université Fédérale de Toulouse, ANITI, INRAE, UR 875, 31326 Toulouse, France;
| | - Sophie Barbe
- Toulouse Biotechnology Institute (TBI), Université de Toulouse, CNRS, INRAE, INSA, ANITI, 31400 Toulouse, France; (J.V.); (G.N.); (C.Y.M.); (C.D.)
- Correspondence:
| |
Collapse
|
8
|
Bouchiba Y, Cortés J, Schiex T, Barbe S. Molecular flexibility in computational protein design: an algorithmic perspective. Protein Eng Des Sel 2021; 34:6271252. [PMID: 33959778 DOI: 10.1093/protein/gzab011] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/12/2021] [Accepted: 03/29/2021] [Indexed: 12/19/2022] Open
Abstract
Computational protein design (CPD) is a powerful technique for engineering new proteins, with both great fundamental implications and diverse practical interests. However, the approximations usually made for computational efficiency, using a single fixed backbone and a discrete set of side chain rotamers, tend to produce rigid and hyper-stable folds that may lack functionality. These approximations contrast with the demonstrated importance of molecular flexibility and motions in a wide range of protein functions. The integration of backbone flexibility and multiple conformational states in CPD, in order to relieve the inaccuracies resulting from these simplifications and to improve design reliability, are attracting increased attention. However, the greatly increased search space that needs to be explored in these extensions defines extremely challenging computational problems. In this review, we outline the principles of CPD and discuss recent effort in algorithmic developments for incorporating molecular flexibility in the design process.
Collapse
Affiliation(s)
- Younes Bouchiba
- Toulouse Biotechnology Institute, TBI, CNRS, INRAE, INSA, ANITI, Toulouse 31400, France.,Laboratoire d'Analyse et d'Architecture des Systèmes, LAAS CNRS, Université de Toulouse, CNRS, Toulouse 31400, France
| | - Juan Cortés
- Laboratoire d'Analyse et d'Architecture des Systèmes, LAAS CNRS, Université de Toulouse, CNRS, Toulouse 31400, France
| | - Thomas Schiex
- Université de Toulouse, ANITI, INRAE, UR MIAT, F-31320, Castanet-Tolosan, France
| | - Sophie Barbe
- Toulouse Biotechnology Institute, TBI, CNRS, INRAE, INSA, ANITI, Toulouse 31400, France
| |
Collapse
|
9
|
Vucinic J, Simoncini D, Ruffini M, Barbe S, Schiex T. Positive multistate protein design. Bioinformatics 2019; 36:122-130. [DOI: 10.1093/bioinformatics/btz497] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Revised: 05/20/2019] [Accepted: 06/11/2019] [Indexed: 11/12/2022] Open
Abstract
Abstract
Motivation
Structure-based computational protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. The usual approach considers a single rigid backbone as a target, which ignores backbone flexibility. Multistate design (MSD) allows instead to consider several backbone states simultaneously, defining challenging computational problems.
Results
We introduce efficient reductions of positive MSD problems to Cost Function Networks with two different fitness definitions and implement them in the Pompd (Positive Multistate Protein design) software. Pompd is able to identify guaranteed optimal sequences of positive multistate full protein redesign problems and exhaustively enumerate suboptimal sequences close to the MSD optimum. Applied to nuclear magnetic resonance and back-rubbed X-ray structures, we observe that the average energy fitness provides the best sequence recovery. Our method outperforms state-of-the-art guaranteed computational design approaches by orders of magnitudes and can solve MSD problems with sizes previously unreachable with guaranteed algorithms.
Availability and implementation
https://forgemia.inra.fr/thomas.schiex/pompd as documented Open Source.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jelena Vucinic
- LISBP, Université de Toulouse, CNRS, INRA, INSA, 31400 Toulouse, France
- MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan Cedex, France
| | - David Simoncini
- LISBP, Université de Toulouse, CNRS, INRA, INSA, 31400 Toulouse, France
- IRIT UMR 5505-CNRS, Université de Toulouse, 31042 Cedex 9, France
| | - Manon Ruffini
- LISBP, Université de Toulouse, CNRS, INRA, INSA, 31400 Toulouse, France
- MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan Cedex, France
| | - Sophie Barbe
- LISBP, Université de Toulouse, CNRS, INRA, INSA, 31400 Toulouse, France
| | - Thomas Schiex
- MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan Cedex, France
| |
Collapse
|
10
|
Peyrard N, Cros M, Givry S, Franc A, Robin S, Sabbadin R, Schiex T, Vignes M. Exact or approximate inference in graphical models: why the choice is dictated by the treewidth, and how variable elimination can be exploited. AUST NZ J STAT 2019. [DOI: 10.1111/anzs.12257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- N. Peyrard
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - M.‐J. Cros
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - S. Givry
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - A. Franc
- INRA UMR 1202 Biodiversité, Gènes et Communautés 69, route d'Arcachon, Pierroton 33612Cestas Cedex France
| | - S. Robin
- AgroParisTech UMR 518 MIA 16 rue Claude Bernard Paris 5e France
- INRA, UMR 518 MIA 16 rue Claude Bernard Paris 5e France
| | - R. Sabbadin
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - T. Schiex
- INRA UR 875 MIAT Chemin de Borde Rouge 31326Castanet‐Tolosan France
| | - M. Vignes
- Institute of Fundamental Sciences Massey University Palmerston North New Zealand
| |
Collapse
|
11
|
Noguchi H, Addy C, Simoncini D, Wouters S, Mylemans B, Van Meervelt L, Schiex T, Zhang KYJ, Tame JRH, Voet ARD. Computational design of symmetrical eight-bladed β-propeller proteins. IUCrJ 2019; 6:46-55. [PMID: 30713702 PMCID: PMC6327176 DOI: 10.1107/s205225251801480x] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 10/19/2018] [Indexed: 05/04/2023]
Abstract
β-Propeller proteins form one of the largest families of protein structures, with a pseudo-symmetrical fold made up of subdomains called blades. They are not only abundant but are also involved in a wide variety of cellular processes, often by acting as a platform for the assembly of protein complexes. WD40 proteins are a subfamily of propeller proteins with no intrinsic enzymatic activity, but their stable, modular architecture and versatile surface have allowed evolution to adapt them to many vital roles. By computationally reverse-engineering the duplication, fusion and diversification events in the evolutionary history of a WD40 protein, a perfectly symmetrical homologue called Tako8 was made. If two or four blades of Tako8 are expressed as single polypeptides, they do not self-assemble to complete the eight-bladed architecture, which may be owing to the closely spaced negative charges inside the ring. A different computational approach was employed to redesign Tako8 to create Ika8, a fourfold-symmetrical protein in which neighbouring blades carry compensating charges. Ika2 and Ika4, carrying two or four blades per subunit, respectively, were found to assemble spontaneously into a complete eight-bladed ring in solution. These artificial eight-bladed rings may find applications in bionanotechnology and as models to study the folding and evolution of WD40 proteins.
Collapse
Affiliation(s)
- Hiroki Noguchi
- Laboratory of Biomolecular Modelling and Design, Department of Chemistry, KU Leuven, Celestijnenlaan 200G, 3001 Leuven, Belgium
| | - Christine Addy
- Graduate School of Medical Life Science, Yokohama City University, 1-7-29 Suehiro, Yokohama, Kanagawa 230-0045, Japan
| | - David Simoncini
- MIAT, Université de Toulouse, INRA, Castanet-Tolosan, France
| | - Staf Wouters
- Laboratory of Biomolecular Modelling and Design, Department of Chemistry, KU Leuven, Celestijnenlaan 200G, 3001 Leuven, Belgium
| | - Bram Mylemans
- Laboratory of Biomolecular Modelling and Design, Department of Chemistry, KU Leuven, Celestijnenlaan 200G, 3001 Leuven, Belgium
| | - Luc Van Meervelt
- Laboratory of Biomolecular Architecture, Department of Chemistry, KU Leuven, Celestijnenlaan 200F, 3001 Leuven, Belgium
| | - Thomas Schiex
- MIAT, Université de Toulouse, INRA, Castanet-Tolosan, France
| | - Kam Y. J. Zhang
- Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, 1-7-22 Suehiro, Yokohama, Kanagawa 230-0045, Japan
| | - Jeremy R. H. Tame
- Graduate School of Medical Life Science, Yokohama City University, 1-7-29 Suehiro, Yokohama, Kanagawa 230-0045, Japan
- Correspondence e-mail: ,
| | - Arnout R. D. Voet
- Laboratory of Biomolecular Modelling and Design, Department of Chemistry, KU Leuven, Celestijnenlaan 200G, 3001 Leuven, Belgium
- Correspondence e-mail: ,
| |
Collapse
|
12
|
Simoncini D, Zhang KYJ, Schiex T, Barbe S. A structural homology approach for computational protein design with flexible backbone. Bioinformatics 2018; 35:2418-2426. [DOI: 10.1093/bioinformatics/bty975] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2018] [Revised: 11/01/2018] [Accepted: 11/28/2018] [Indexed: 01/09/2023] Open
Abstract
Abstract
Motivation
Structure-based Computational Protein design (CPD) plays a critical role in advancing the field of protein engineering. Using an all-atom energy function, CPD tries to identify amino acid sequences that fold into a target structure and ultimately perform a desired function. Energy functions remain however imperfect and injecting relevant information from known structures in the design process should lead to improved designs.
Results
We introduce Shades, a data-driven CPD method that exploits local structural environments in known protein structures together with energy to guide sequence design, while sampling side-chain and backbone conformations to accommodate mutations. Shades (Structural Homology Algorithm for protein DESign), is based on customized libraries of non-contiguous in-contact amino acid residue motifs. We have tested Shades on a public benchmark of 40 proteins selected from different protein families. When excluding homologous proteins, Shades achieved a protein sequence recovery of 30% and a protein sequence similarity of 46% on average, compared with the PFAM protein family of the target protein. When homologous structures were added, the wild-type sequence recovery rate achieved 93%.
Availability and implementation
Shades source code is available at https://bitbucket.org/satsumaimo/shades as a patch for Rosetta 3.8 with a curated protein structure database and ITEM library creation software.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David Simoncini
- Laboratoire d'Ingénierie des Systèmes Biologiques et des Procédés, LISBP, Université de Toulouse, CNRS, INRA, INSA, F Toulouse cedex 04, France
- Institut de recherche en informatique de Toulouse, IRIT, UMR 5505-CNRS, Université de Toulouse, Cedex 9, France
| | - Kam Y J Zhang
- Laboratory for Structural Bioinformatics, Center for Biosystems Dynamics Research, RIKEN, Yokohama, Kanagawa, Japan
| | - Thomas Schiex
- Institut de recherche en informatique de Toulouse, UMR 5505-CNRS, Université de Toulouse, Cedex 9, France
| | - Sophie Barbe
- Laboratoire d'Ingénierie des Systèmes Biologiques et des Procédés, LISBP, Université de Toulouse, CNRS, INRA, INSA, F Toulouse cedex 04, France
| |
Collapse
|
13
|
Charpentier A, Mignon D, Barbe S, Cortes J, Schiex T, Simonson T, Allouche D. Variable Neighborhood Search with Cost Function Networks To Solve Large Computational Protein Design Problems. J Chem Inf Model 2018; 59:127-136. [DOI: 10.1021/acs.jcim.8b00510] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
| | - David Mignon
- Laboratoire de Biochimie (CNRS UMR 7654), École Polytechnique, 91128 Palaiseau, France
| | - Sophie Barbe
- Laboratoire d’Ingénierie des Systèmes Biologiques et Procédés, LISBP, Université de Toulouse, CNRS, INRA, INSA, 31077 Toulouse, France
| | - Juan Cortes
- LAAS-CNRS, Université de Toulouse, CNRS, 31400 Toulouse, France
| | - Thomas Schiex
- MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan, France
| | - Thomas Simonson
- Laboratoire de Biochimie (CNRS UMR 7654), École Polytechnique, 91128 Palaiseau, France
| | - David Allouche
- MIAT, Université de Toulouse, INRA, 31326 Castanet-Tolosan, France
| |
Collapse
|
14
|
Viricel C, de Givry S, Schiex T, Barbe S. Cost function network-based design of protein–protein interactions: predicting changes in binding affinity. Bioinformatics 2018; 34:2581-2589. [DOI: 10.1093/bioinformatics/bty092] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 02/16/2018] [Indexed: 11/14/2022] Open
Affiliation(s)
- Clément Viricel
- Laboratoire d’Ingénierie des Systèmes Biologiques et des Procédés, Université de Toulouse, CNRS, INRA, INSA, Toulouse, France
- Unité de Mathématiques et Informatique Appliquées de Toulouse, INRA, Castanet Tolosan cedex, France
| | - Simon de Givry
- Unité de Mathématiques et Informatique Appliquées de Toulouse, INRA, Castanet Tolosan cedex, France
| | - Thomas Schiex
- Unité de Mathématiques et Informatique Appliquées de Toulouse, INRA, Castanet Tolosan cedex, France
| | - Sophie Barbe
- Laboratoire d’Ingénierie des Systèmes Biologiques et des Procédés, Université de Toulouse, CNRS, INRA, INSA, Toulouse, France
| |
Collapse
|
15
|
Simoncini D, Schiex T, Zhang KYJ. Balancing exploration and exploitation in population-based sampling improves fragment-based de novo protein structure prediction. Proteins 2017; 85:852-858. [PMID: 28066917 DOI: 10.1002/prot.25244] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Revised: 11/29/2016] [Accepted: 12/18/2016] [Indexed: 01/17/2023]
Abstract
Conformational search space exploration remains a major bottleneck for protein structure prediction methods. Population-based meta-heuristics typically enable the possibility to control the search dynamics and to tune the balance between local energy minimization and search space exploration. EdaFold is a fragment-based approach that can guide search by periodically updating the probability distribution over the fragment libraries used during model assembly. We implement the EdaFold algorithm as a Rosetta protocol and provide two different probability update policies: a cluster-based variation (EdaRosec ) and an energy-based one (EdaRoseen ). We analyze the search dynamics of our new Rosetta protocols and show that EdaRosec is able to provide predictions with lower C αRMSD to the native structure than EdaRoseen and Rosetta AbInitio Relax protocol. Our software is freely available as a C++ patch for the Rosetta suite and can be downloaded from http://www.riken.jp/zhangiru/software/. Our protocols can easily be extended in order to create alternative probability update policies and generate new search dynamics. Proteins 2017; 85:852-858. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- David Simoncini
- INRA MIAT, UR 875, Castanet-Tolosan Cedex, 31326, France.,Structural Bioinformatics Team, Division of Structural and Synthetic Biology, Center for Life Science Technologies, RIKEN, 1-7-22 Suehiro, Yokohama, Kanagawa, 230-0045, Japan
| | - Thomas Schiex
- INRA MIAT, UR 875, Castanet-Tolosan Cedex, 31326, France
| | - Kam Y J Zhang
- Structural Bioinformatics Team, Division of Structural and Synthetic Biology, Center for Life Science Technologies, RIKEN, 1-7-22 Suehiro, Yokohama, Kanagawa, 230-0045, Japan
| |
Collapse
|
16
|
Abstract
One main challenge in Computational Protein Design (CPD) lies in the exploration of the amino-acid sequence space, while considering, to some extent, side chain flexibility. The exorbitant size of the search space urges for the development of efficient exact deterministic search methods enabling identification of low-energy sequence-conformation models, corresponding either to the global minimum energy conformation (GMEC) or an ensemble of guaranteed near-optimal solutions. In contrast to stochastic local search methods that are not guaranteed to find the GMEC, exact deterministic approaches always identify the GMEC and prove its optimality in finite but exponential worst-case time. After a brief overview on these two classes of methods, we discuss the grounds and merits of four deterministic methods that have been applied to solve CPD problems. These approaches are based either on the Dead-End-Elimination theorem combined with A* algorithm (DEE/A*), on Cost Function Networks algorithms (CFN), on Integer Linear Programming solvers (ILP) or on Markov Random Fields solvers (MRF). The way two of these methods (DEE/A* and CFN) can be used in practice to identify low-energy sequence-conformation models starting from a pairwise decomposed energy matrix is detailed in this review.
Collapse
Affiliation(s)
- Seydou Traoré
- INSA, UPS, INP, Université de Toulouse, 135 Avenue de Rangueil, 31077, Toulouse, France
- Laboratoire d'Ingénierie Ingénierie des Systèmes Biologiques et des Procédés - INSA, INRA, UMR792, 31400, Toulouse, France
- CNRS, UMR5504, 31400, Toulouse, France
| | - David Allouche
- Unité de Mathématiques et Informatique de Toulouse, UR 875, INRA, 31320, Castanet Tolosan, France
| | - Isabelle André
- INSA, UPS, INP, Université de Toulouse, 135 Avenue de Rangueil, 31077, Toulouse, France
- Laboratoire d'Ingénierie Ingénierie des Systèmes Biologiques et des Procédés - INSA, INRA, UMR792, 31400, Toulouse, France
- CNRS, UMR5504, 31400, Toulouse, France
| | - Thomas Schiex
- Unité de Mathématiques et Informatique de Toulouse, UR 875, INRA, 31320, Castanet Tolosan, France
| | - Sophie Barbe
- INSA, UPS, INP, Université de Toulouse, 135 Avenue de Rangueil, 31077, Toulouse, France.
- Laboratoire d'Ingénierie Ingénierie des Systèmes Biologiques et des Procédés - INSA, INRA, UMR792, 31400, Toulouse, France.
- CNRS, UMR5504, 31400, Toulouse, France.
| |
Collapse
|
17
|
Allouche D, Bessiere C, Boizumault P, de Givry S, Gutierrez P, Lee JH, Leung KL, Loudni S, Métivier JP, Schiex T, Wu Y. Tractability-preserving transformations of global cost functions. ARTIF INTELL 2016. [DOI: 10.1016/j.artint.2016.06.005] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
18
|
Traoré S, Roberts KE, Allouche D, Donald BR, André I, Schiex T, Barbe S. Fast search algorithms for computational protein design. J Comput Chem 2016; 37:1048-58. [PMID: 26833706 PMCID: PMC4828276 DOI: 10.1002/jcc.24290] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Revised: 09/23/2015] [Accepted: 11/27/2015] [Indexed: 12/12/2022]
Abstract
One of the main challenges in computational protein design (CPD) is the huge size of the protein sequence and conformational space that has to be computationally explored. Recently, we showed that state-of-the-art combinatorial optimization technologies based on Cost Function Network (CFN) processing allow speeding up provable rigid backbone protein design methods by several orders of magnitudes. Building up on this, we improved and injected CFN technology into the well-established CPD package Osprey to allow all Osprey CPD algorithms to benefit from associated speedups. Because Osprey fundamentally relies on the ability of A* to produce conformations in increasing order of energy, we defined new A* strategies combining CFN lower bounds, with new side-chain positioning-based branching scheme. Beyond the speedups obtained in the new A*-CFN combination, this novel branching scheme enables a much faster enumeration of suboptimal sequences, far beyond what is reachable without it. Together with the immediate and important speedups provided by CFN technology, these developments directly benefit to all the algorithms that previously relied on the DEE/ A* combination inside Osprey* and make it possible to solve larger CPD problems with provable algorithms.
Collapse
Affiliation(s)
- Seydou Traoré
- Université de Toulouse; INSA, UPS, INP; LISBP, 135 Avenue de Rangueil, F-31077 Toulouse, France
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400 Toulouse, France
- CNRS, UMR5504, F-31400 Toulouse, France
| | - Kyle E. Roberts
- Department of Biochemistry, Department of Computer Science, Department of Chemistry, Duke University, Durham, NC, USA
| | - David Allouche
- Unité de Mathématiques et Informatique Appliquées de Toulouse, UR 875, INRA, F-31320 Castanet Tolosan, France
| | - Bruce R. Donald
- Department of Biochemistry, Department of Computer Science, Department of Chemistry, Duke University, Durham, NC, USA
| | - Isabelle André
- Université de Toulouse; INSA, UPS, INP; LISBP, 135 Avenue de Rangueil, F-31077 Toulouse, France
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400 Toulouse, France
- CNRS, UMR5504, F-31400 Toulouse, France
| | - Thomas Schiex
- Unité de Mathématiques et Informatique Appliquées de Toulouse, UR 875, INRA, F-31320 Castanet Tolosan, France
| | - Sophie Barbe
- Université de Toulouse; INSA, UPS, INP; LISBP, 135 Avenue de Rangueil, F-31077 Toulouse, France
- INRA, UMR792, Ingénierie des Systèmes Biologiques et des Procédés, F-31400 Toulouse, France
- CNRS, UMR5504, F-31400 Toulouse, France
| |
Collapse
|
19
|
Simoncini D, Allouche D, de Givry S, Delmas C, Barbe S, Schiex T. Guaranteed Discrete Energy Optimization on Large Protein Design Problems. J Chem Theory Comput 2015; 11:5980-9. [DOI: 10.1021/acs.jctc.5b00594] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
| | - David Allouche
- INRA MIAT, UR 875, Castanet-Tolosan, 31326 Cedex, France
| | - Simon de Givry
- INRA MIAT, UR 875, Castanet-Tolosan, 31326 Cedex, France
| | - Céline Delmas
- INRA MIAT, UR 875, Castanet-Tolosan, 31326 Cedex, France
| | - Sophie Barbe
- Université de Toulouse; INSA, UPS, INP; LISBP, 135 Avenue de Rangueil, F-31077 Toulouse, France
- CNRS, UMR5504, F-31400 Toulouse, France
- INRA, UMR792 Ingénierie des Systèmes Biologiques et des Procédés, F-31400 Toulouse, France
| | - Thomas Schiex
- INRA MIAT, UR 875, Castanet-Tolosan, 31326 Cedex, France
| |
Collapse
|
20
|
Allouche D, André I, Barbe S, Davies J, de Givry S, Katsirelos G, O'Sullivan B, Prestwich S, Schiex T, Traoré S. Computational protein design as an optimization problem. ARTIF INTELL 2014. [DOI: 10.1016/j.artint.2014.03.005] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
21
|
Abstract
UNLABELLED It is now easy and increasingly usual to produce oriented RNA-Seq data as a prokaryotic genome is being sequenced. However, this information is usually just used for expression quantification. EuGene-PP is a fully automated pipeline for structural annotation of prokaryotic genomes integrating protein similarities, statistical information and any oriented expression information (RNA-Seq or tiling arrays) through a variety of file formats to produce a qualitatively enriched annotation including coding regions but also (possibly antisense) non-coding genes and transcription start sites. AVAILABILITY AND IMPLEMENTATION EuGene-PP is an open-source software based on EuGene-P integrating a Galaxy configuration. EuGene-PP can be downloaded at eugene.toulouse.inra.fr.
Collapse
Affiliation(s)
- Erika Sallet
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 and INRA, Unité de Mathématiques et Informatique Appliques de Toulouse, UR 875, Castanet-Tolosan F-31326, France
| | - Jérôme Gouzy
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 and INRA, Unité de Mathématiques et Informatique Appliques de Toulouse, UR 875, Castanet-Tolosan F-31326, France
| | - Thomas Schiex
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 and INRA, Unité de Mathématiques et Informatique Appliques de Toulouse, UR 875, Castanet-Tolosan F-31326, France
| |
Collapse
|
22
|
Traoré S, Allouche D, André I, de Givry S, Katsirelos G, Schiex T, Barbe S. A new framework for computational protein design through cost function network optimization. Bioinformatics 2013; 29:2129-36. [DOI: 10.1093/bioinformatics/btt374] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
23
|
Sallet E, Roux B, Sauviac L, Jardinaud MF, Carrère S, Faraut T, de Carvalho-Niebel F, Gouzy J, Gamas P, Capela D, Bruand C, Schiex T. Next-generation annotation of prokaryotic genomes with EuGene-P: application to Sinorhizobium meliloti 2011. DNA Res 2013; 20:339-54. [PMID: 23599422 PMCID: PMC3738161 DOI: 10.1093/dnares/dst014] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
The availability of next-generation sequences of transcripts from prokaryotic organisms offers the opportunity to design a new generation of automated genome annotation tools not yet available for prokaryotes. In this work, we designed EuGene-P, the first integrative prokaryotic gene finder tool which combines a variety of high-throughput data, including oriented RNA-Seq data, directly into the prediction process. This enables the automated prediction of coding sequences (CDSs), untranslated regions, transcription start sites (TSSs) and non-coding RNA (ncRNA, sense and antisense) genes. EuGene-P was used to comprehensively and accurately annotate the genome of the nitrogen-fixing bacterium Sinorhizobium meliloti strain 2011, leading to the prediction of 6308 CDSs as well as 1876 ncRNAs. Among them, 1280 appeared as antisense to a CDS, which supports recent findings that antisense transcription activity is widespread in bacteria. Moreover, 4077 TSSs upstream of protein-coding or non-coding genes were precisely mapped providing valuable data for the study of promoter regions. By looking for RpoE2-binding sites upstream of annotated TSSs, we were able to extend the S. meliloti RpoE2 regulon by ∼3-fold. Altogether, these observations demonstrate the power of EuGene-P to produce a reliable and high-resolution automatic annotation of prokaryotic genomes.
Collapse
Affiliation(s)
- Erika Sallet
- INRA, Laboratoire des Interactions Plantes-Microorganismes-LIPM, UMR 441, Castanet-Tolosan F-31326, France
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Abstract
BACKGROUND Detecting duplication segments within completely sequenced genomes provides valuable information to address genome evolution and in particular the important question of the emergence of novel functions. The usual approach to gene duplication detection, based on all-pairs protein gene comparisons, provides only a restricted view of duplication. RESULTS In this paper, we introduce ReD Tandem, a software using a flow based chaining algorithm targeted at detecting tandem duplication arrays of moderate to longer length regions, with possibly locally weak similarities, directly at the DNA level. On the A. thaliana genome, using a reference set of tandem duplicated genes built using TAIR,(a) we show that ReD Tandem is able to predict a large fraction of recently duplicated genes (dS < 1) and that it is also able to predict tandem duplications involving non coding elements such as pseudo-genes or RNA genes. CONCLUSIONS ReD Tandem allows to identify large tandem duplications without any annotation, leading to agnostic identification of tandem duplications. This approach nicely complements the usual protein gene based which ignores duplications involving non coding regions. It is however inherently restricted to relatively recent duplications. By recovering otherwise ignored events, ReD Tandem gives a more comprehensive view of existing evolutionary processes and may also allow to improve existing annotations.
Collapse
Affiliation(s)
- Eric Audemard
- Unité de Biométrie et Intelligence Artificielle, UR 875, INRA, Toulouse, France.
| | | | | |
Collapse
|
25
|
Allouche D, Traoré S, André I, de Givry S, Katsirelos G, Barbe S, Schiex T. Computational Protein Design as a Cost Function Network Optimization Problem. Lecture Notes in Computer Science 2012. [DOI: 10.1007/978-3-642-33558-7_60] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
26
|
Vignes M, Vandel J, Allouche D, Ramadan-Alban N, Cierco-Ayrolles C, Schiex T, Mangin B, de Givry S. Gene regulatory network reconstruction using Bayesian networks, the Dantzig Selector, the Lasso and their meta-analysis. PLoS One 2011; 6:e29165. [PMID: 22216195 PMCID: PMC3246469 DOI: 10.1371/journal.pone.0029165] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2011] [Accepted: 11/22/2011] [Indexed: 11/18/2022] Open
Abstract
Modern technologies and especially next generation sequencing facilities are giving a cheaper access to genotype and genomic data measured on the same sample at once. This creates an ideal situation for multifactorial experiments designed to infer gene regulatory networks. The fifth "Dialogue for Reverse Engineering Assessments and Methods" (DREAM5) challenges are aimed at assessing methods and associated algorithms devoted to the inference of biological networks. Challenge 3 on "Systems Genetics" proposed to infer causal gene regulatory networks from different genetical genomics data sets. We investigated a wide panel of methods ranging from Bayesian networks to penalised linear regressions to analyse such data, and proposed a simple yet very powerful meta-analysis, which combines these inference methods. We present results of the Challenge as well as more in-depth analysis of predicted networks in terms of structure and reliability. The developed meta-analysis was ranked first among the 16 teams participating in Challenge 3A. It paves the way for future extensions of our inference method and more accurate gene network estimates in the context of genetical genomics.
Collapse
Affiliation(s)
- Matthieu Vignes
- SaAB Team/BIA Unit, INRA Toulouse, Castanet-Tolosan, France.
| | | | | | | | | | | | | | | |
Collapse
|
27
|
Young ND, Debellé F, Oldroyd GED, Geurts R, Cannon SB, Udvardi MK, Benedito VA, Mayer KFX, Gouzy J, Schoof H, Van de Peer Y, Proost S, Cook DR, Meyers BC, Spannagl M, Cheung F, De Mita S, Krishnakumar V, Gundlach H, Zhou S, Mudge J, Bharti AK, Murray JD, Naoumkina MA, Rosen B, Silverstein KAT, Tang H, Rombauts S, Zhao PX, Zhou P, Barbe V, Bardou P, Bechner M, Bellec A, Berger A, Bergès H, Bidwell S, Bisseling T, Choisne N, Couloux A, Denny R, Deshpande S, Dai X, Doyle JJ, Dudez AM, Farmer AD, Fouteau S, Franken C, Gibelin C, Gish J, Goldstein S, González AJ, Green PJ, Hallab A, Hartog M, Hua A, Humphray SJ, Jeong DH, Jing Y, Jöcker A, Kenton SM, Kim DJ, Klee K, Lai H, Lang C, Lin S, Macmil SL, Magdelenat G, Matthews L, McCorrison J, Monaghan EL, Mun JH, Najar FZ, Nicholson C, Noirot C, O'Bleness M, Paule CR, Poulain J, Prion F, Qin B, Qu C, Retzel EF, Riddle C, Sallet E, Samain S, Samson N, Sanders I, Saurat O, Scarpelli C, Schiex T, Segurens B, Severin AJ, Sherrier DJ, Shi R, Sims S, Singer SR, Sinharoy S, Sterck L, Viollet A, Wang BB, Wang K, Wang M, Wang X, Warfsmann J, Weissenbach J, White DD, White JD, Wiley GB, Wincker P, Xing Y, Yang L, Yao Z, Ying F, Zhai J, Zhou L, Zuber A, Dénarié J, Dixon RA, May GD, Schwartz DC, Rogers J, Quétier F, Town CD, Roe BA. The Medicago genome provides insight into the evolution of rhizobial symbioses. Nature 2011; 480:520-4. [PMID: 22089132 PMCID: PMC3272368 DOI: 10.1038/nature10625] [Citation(s) in RCA: 762] [Impact Index Per Article: 58.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2011] [Accepted: 10/13/2011] [Indexed: 11/09/2022]
Abstract
Legumes (Fabaceae or Leguminosae) are unique among cultivated plants for their ability to carry out endosymbiotic nitrogen fixation with rhizobial bacteria, a process that takes place in a specialized structure known as the nodule. Legumes belong to one of the two main groups of eurosids, the Fabidae, which includes most species capable of endosymbiotic nitrogen fixation. Legumes comprise several evolutionary lineages derived from a common ancestor 60 million years ago (Myr ago). Papilionoids are the largest clade, dating nearly to the origin of legumes and containing most cultivated species. Medicago truncatula is a long-established model for the study of legume biology. Here we describe the draft sequence of the M. truncatula euchromatin based on a recently completed BAC assembly supplemented with Illumina shotgun sequence, together capturing ∼94% of all M. truncatula genes. A whole-genome duplication (WGD) approximately 58 Myr ago had a major role in shaping the M. truncatula genome and thereby contributed to the evolution of endosymbiotic nitrogen fixation. Subsequent to the WGD, the M. truncatula genome experienced higher levels of rearrangement than two other sequenced legumes, Glycine max and Lotus japonicus. M. truncatula is a close relative of alfalfa (Medicago sativa), a widely cultivated crop with limited genomics tools and complex autotetraploid genetics. As such, the M. truncatula genome sequence provides significant opportunities to expand alfalfa's genomic toolbox.
Collapse
Affiliation(s)
- Nevin D Young
- Department of Plant Pathology, University of Minnesota, St Paul, Minnesota 55108, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
|
29
|
Abstract
The Weighted Constraint Satisfaction Problem (WCSP) framework allows representing and solving problems involving both hard constraints and cost functions. It has been applied to various problems, including resource allocation, bioinformatics, scheduling, etc. To solve such problems, solvers usually rely on branch-and-bound algorithms equipped with local consistency filtering, mostly soft arc consistency. However, these techniques are not well suited to solve problems with very large domains. Motivated by the resolution of an RNA gene localization problem inside large genomic sequences, and in the spirit of bounds consistency for large domains in crisp CSPs, we introduce soft bounds arc consistency, a new weighted local consistency specifically designed for WCSP with very large domains. Compared to soft arc consistency, BAC provides significantly improved time and space asymptotic complexity. In this paper, we show how the semantics of cost functions can be exploited to further improve the time complexity of BAC. We also compare both in theory and in practice the efficiency of BAC on a WCSP with bounds consistency enforced on a crisp CSP using cost variables. On two different real problems modeled as WCSP, including our RNA gene localization problem, we observe that maintaining bounds arc consistency outperforms arc consistency and also improves over bounds consistency enforced on a constraint model with cost variables.
Collapse
|
30
|
Abstract
Summary: Transcriptome sequencing represents a fundamental source of information for genome-wide studies and transcriptome analysis and will become increasingly important for expression analysis as new sequencing technologies takes over array technology. The identification of the protein-coding region in transcript sequences is a prerequisite for systematic amino acid-level analysis and more specifically for domain identification. In this article, we present FrameDP, a self-training integrative pipeline for predicting CDS in transcripts which can adapt itself to different levels of sequence qualities. Availability: FrameDP for Linux (web-server and underlying pipeline) is available at {{http://iant.toulouse.inra.fr/FrameDP}} for direct use or a standalone installation. Contact:thomas.schiex@toulouse.inra.fr
Collapse
Affiliation(s)
- Jérôme Gouzy
- Laboratoire Interactions Plantes Micro-organismes (LIPM) UMR441/2594, INRA/CNRS, F-31320 Castanet Tolosan, France
| | | | | |
Collapse
|
31
|
Faraut T, de Givry S, Hitte C, Lahbib-Mansais Y, Morisson M, Milan D, Schiex T, Servin B, Vignal A, Galibert F, Yerle M. Contribution of Radiation Hybrids to Genome Mapping in Domestic Animals. Cytogenet Genome Res 2009; 126:21-33. [DOI: 10.1159/000245904] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/26/2009] [Indexed: 11/19/2022] Open
|
32
|
Abad P, Gouzy J, Aury JM, Castagnone-Sereno P, Danchin EGJ, Deleury E, Perfus-Barbeoch L, Anthouard V, Artiguenave F, Blok VC, Caillaud MC, Coutinho PM, Dasilva C, De Luca F, Deau F, Esquibet M, Flutre T, Goldstone JV, Hamamouch N, Hewezi T, Jaillon O, Jubin C, Leonetti P, Magliano M, Maier TR, Markov GV, McVeigh P, Pesole G, Poulain J, Robinson-Rechavi M, Sallet E, Ségurens B, Steinbach D, Tytgat T, Ugarte E, van Ghelder C, Veronico P, Baum TJ, Blaxter M, Bleve-Zacheo T, Davis EL, Ewbank JJ, Favery B, Grenier E, Henrissat B, Jones JT, Laudet V, Maule AG, Quesneville H, Rosso MN, Schiex T, Smant G, Weissenbach J, Wincker P. Genome sequence of the metazoan plant-parasitic nematode Meloidogyne incognita. Nat Biotechnol 2008; 26:909-15. [PMID: 18660804 DOI: 10.1038/nbt.1482] [Citation(s) in RCA: 666] [Impact Index Per Article: 41.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2008] [Accepted: 06/25/2008] [Indexed: 01/15/2023]
Abstract
Plant-parasitic nematodes are major agricultural pests worldwide and novel approaches to control them are sorely needed. We report the draft genome sequence of the root-knot nematode Meloidogyne incognita, a biotrophic parasite of many crops, including tomato, cotton and coffee. Most of the assembled sequence of this asexually reproducing nematode, totaling 86 Mb, exists in pairs of homologous but divergent segments. This suggests that ancient allelic regions in M. incognita are evolving toward effective haploidy, permitting new mechanisms of adaptation. The number and diversity of plant cell wall-degrading enzymes in M. incognita is unprecedented in any animal for which a genome sequence is available, and may derive from multiple horizontal gene transfers from bacterial sources. Our results provide insights into the adaptations required by metazoans to successfully parasitize immunocompetent plants, and open the way for discovering new antiparasitic strategies.
Collapse
Affiliation(s)
- Pierre Abad
- INRA, UMR 1301, 400 route des Chappes, F-06903 Sophia-Antipolis, France.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Foissac S, Gouzy J, Rombauts S, Mathe C, Amselem J, Sterck L, de Peer Y, Rouze P, Schiex T. Genome Annotation in Plants and Fungi: EuGene as a Model Platform. Curr Bioinform 2008. [DOI: 10.2174/157489308784340702] [Citation(s) in RCA: 91] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
34
|
Noirot C, Gaspin C, Schiex T, Gouzy J. LeARN: a platform for detecting, clustering and annotating non-coding RNAs. BMC Bioinformatics 2008; 9:21. [PMID: 18194551 PMCID: PMC2241582 DOI: 10.1186/1471-2105-9-21] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2007] [Accepted: 01/14/2008] [Indexed: 11/16/2022] Open
Abstract
Background In the last decade, sequencing projects have led to the development of a number of annotation systems dedicated to the structural and functional annotation of protein-coding genes. These annotation systems manage the annotation of the non-protein coding genes (ncRNAs) in a very crude way, allowing neither the edition of the secondary structures nor the clustering of ncRNA genes into families which are crucial for appropriate annotation of these molecules. Results LeARN is a flexible software package which handles the complete process of ncRNA annotation by integrating the layers of automatic detection and human curation. Conclusion This software provides the infrastructure to deal properly with ncRNAs in the framework of any annotation project. It fills the gap between existing prediction software, that detect independent ncRNA occurrences, and public ncRNA repositories, that do not offer the flexibility and interactivity required for annotation projects. The software is freely available from the download section of the website
Collapse
Affiliation(s)
- Céline Noirot
- Laboratoire Interactions Plantes Micro-organismes UMR441/2594, INRA/CNRS, F-31320 Castanet Tolosan, France.
| | | | | | | |
Collapse
|
35
|
Aubourg S, Martin-Magniette ML, Brunaud V, Taconnat L, Bitton F, Balzergue S, Jullien PE, Ingouff M, Thareau V, Schiex T, Lecharny A, Renou JP. Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome. BMC Genomics 2007; 8:401. [PMID: 17980019 PMCID: PMC2174955 DOI: 10.1186/1471-2164-8-401] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2007] [Accepted: 11/02/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Since the finishing of the sequencing of the Arabidopsis thaliana genome, the Arabidopsis community and the annotator centers have been working on the improvement of gene annotation at the structural and functional levels. In this context, we have used the large CATMA resource on the Arabidopsis transcriptome to search for genes missed by different annotation processes. Probes on the CATMA microarrays are specific gene sequence tags (GSTs) based on the CDS models predicted by the Eugene software. Among the 24 576 CATMA v2 GSTs, 677 are in regions considered as intergenic by the TAIR annotation. We analyzed the cognate transcriptome data in the CATMA resource and carried out data-mining to characterize novel genes and improve gene models. RESULTS The statistical analysis of the results of more than 500 hybridized samples distributed among 12 organs provides an experimental validation for 465 novel genes. The hybridization evidence was confirmed by RT-PCR approaches for 88% of the 465 novel genes. Comparisons with the current annotation show that these novel genes often encode small proteins, with an average size of 137 aa. Our approach has also led to the improvement of pre-existing gene models through both the extension of 16 CDS and the identification of 13 gene models erroneously constituted of two merged CDS. CONCLUSION This work is a noticeable step forward in the improvement of the Arabidopsis genome annotation. We increased the number of Arabidopsis validated genes by 465 novel transcribed genes to which we associated several functional annotations such as expression profiles, sequence conservation in plants, cognate transcripts and protein motifs.
Collapse
Affiliation(s)
- Sébastien Aubourg
- Unité de Recherche en Génomique Végétale (URGV), UMR INRA 1165-CNRS 8114-UEVE, 2 Rue Gaston Crémieux, 91057 Evry Cedex, France.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Prasad A, Schiex T, McKay S, Murdoch B, Wang Z, Womack JE, Stothard P, Moore SS. High resolution radiation hybrid maps of bovine chromosomes 19 and 29: comparison with the bovine genome sequence assembly. BMC Genomics 2007; 8:310. [PMID: 17784962 PMCID: PMC2064936 DOI: 10.1186/1471-2164-8-310] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2007] [Accepted: 09/04/2007] [Indexed: 12/05/2022] Open
Abstract
Background High resolution radiation hybrid (RH) maps can facilitate genome sequence assembly by correctly ordering genes and genetic markers along chromosomes. The objective of the present study was to generate high resolution RH maps of bovine chromosomes 19 (BTA19) and 29 (BTA29), and compare them with the current 7.1X bovine genome sequence assembly (bovine build 3.1). We have chosen BTA19 and 29 as candidate chromosomes for mapping, since many Quantitative Trait Loci (QTL) for the traits of carcass merit and residual feed intake have been identified on these chromosomes. Results We have constructed high resolution maps of BTA19 and BTA29 consisting of 555 and 253 Single Nucleotide Polymorphism (SNP) markers respectively using a 12,000 rad whole genome RH panel. With these markers, the RH map of BTA19 and BTA29 extended to 4591.4 cR and 2884.1 cR in length respectively. When aligned with the current bovine build 3.1, the order of markers on the RH map for BTA19 and 29 showed inconsistencies with respect to the genome assembly. Maps of both the chromosomes show that there is a significant internal rearrangement of the markers involving displacement, inversion and flips within the scaffolds with some scaffolds being misplaced in the genome assembly. We also constructed cattle-human comparative maps of these chromosomes which showed an overall agreement with the comparative maps published previously. However, minor discrepancies in the orientation of few homologous synteny blocks were observed. Conclusion The high resolution maps of BTA19 (average 1 locus/139 kb) and BTA29 (average 1 locus/208 kb) presented in this study suggest that by the incorporation of RH mapping information, the current bovine genome sequence assembly can be significantly improved. Furthermore, these maps can serve as a potential resource for fine mapping QTL and identification of causative mutations underlying QTL for economically important traits.
Collapse
Affiliation(s)
- Aparna Prasad
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton T6G2P5, Alberta, Canada
| | | | - Stephanie McKay
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton T6G2P5, Alberta, Canada
| | - Brenda Murdoch
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton T6G2P5, Alberta, Canada
| | - Zhiquan Wang
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton T6G2P5, Alberta, Canada
| | | | - Paul Stothard
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton T6G2P5, Alberta, Canada
| | - Stephen S Moore
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton T6G2P5, Alberta, Canada
| |
Collapse
|
37
|
Pralet C, Verfaillie G, Schiex T. An Algebraic Graphical Model for Decision with Uncertainties, Feasibilities, and Utilities. J ARTIF INTELL RES 2007. [DOI: 10.1613/jair.2151] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
Numerous formalisms and dedicated algorithms have been designed in the last decades to model and solve decision making problems. Some formalisms, such as constraint networks, can express "simple" decision problems, while others are designed to take into account uncertainties, unfeasible decisions, and utilities. Even in a single formalism, several variants are often proposed to model different types of uncertainty (probability, possibility...) or utility (additive or not). In this article, we introduce an algebraic graphical model that encompasses a large number of such formalisms: (1) we first adapt previous structures from Friedman, Chu and Halpern for representing uncertainty, utility, and expected utility in order to deal with generic forms of sequential decision making; (2) on these structures, we then introduce composite graphical models that express information via variables linked by "local" functions, thanks to conditional independence; (3) on these graphical models, we finally define a simple class of queries which can represent various scenarios in terms of observabilities and controllabilities. A natural decision-tree semantics for such queries is completed by an equivalent operational semantics, which induces generic algorithms. The proposed framework, called the Plausibility-Feasibility-Utility (PFU) framework, not only provides a better understanding of the links between existing formalisms, but it also covers yet unpublished frameworks (such as possibilistic influence diagrams) and unifies formalisms such as quantified boolean formulas and influence diagrams. Our backtrack and variable elimination generic algorithms are a first step towards unified algorithms.
Collapse
|
38
|
Abstract
MOTIVATION Genome maps are fundamental to the study of an organism and essential in the process of genome sequencing which in turn provides the ultimate map of the genome. The increased number of genomes being sequenced offers new opportunities for the mapping of closely related organisms. We propose here an algorithmic formalization of a genome comparison approach to marker ordering. RESULTS In order to integrate a comparative mapping approach in the algorithmic process of map construction and selection, we propose to extend the usual statistical model describing the experimental data, here radiation hybrids (RH) data, in a statistical framework that models additionally the evolutionary relationships between a proposed map and a reference map: an existing map of the corresponding orthologous genes or markers in a closely related organism. This has concretely the effect of exploiting, in the process of map selection, the information of marker adjacencies in the related genome when the information provided by the experimental data is not conclusive for the purpose of ordering. In order to compute efficiently the map, we proceed to a reduction of the maximum likelihood estimation to the Traveling Salesman Problem. Experiments on simulated RH datasets as well as on a real RH dataset from the canine RH project show that maps produced using the likelihood defined by the new model are significantly better than maps built using the traditional RH model. AVAILABILITY The comparative mapping approach is available in the last version of de Givry,S. et al. [(2004) Bioinformatics, 21, 1703-1704, www.inra.fr/mia/T/CarthaGene], a free (the LKH part is free for academic use only) mapping software in C++, including LKH (Helsgaun,K. (2000) Eur. J. Oper. Res., 126, 106-130, www.dat.ruc.dk/keld/research/LKH) for maximum likelihood computation.
Collapse
Affiliation(s)
- T Faraut
- Laboratoire de génétique cellulaire BP 52627, 31326 Castanet Tolosan, France.
| | | | | | | | | | | | | |
Collapse
|
39
|
Cannon SB, Sterck L, Rombauts S, Sato S, Cheung F, Gouzy J, Wang X, Mudge J, Vasdewani J, Schiex T, Spannagl M, Monaghan E, Nicholson C, Humphray SJ, Schoof H, Mayer KFX, Rogers J, Quétier F, Oldroyd GE, Debellé F, Cook DR, Retzel EF, Roe BA, Town CD, Tabata S, Van de Peer Y, Young ND. Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. Proc Natl Acad Sci U S A 2006; 103:14959-64. [PMID: 17003129 PMCID: PMC1578499 DOI: 10.1073/pnas.0603228103] [Citation(s) in RCA: 237] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genome sequencing of the model legumes, Medicago truncatula and Lotus japonicus, provides an opportunity for large-scale sequence-based comparison of two genomes in the same plant family. Here we report synteny comparisons between these species, including details about chromosome relationships, large-scale synteny blocks, microsynteny within blocks, and genome regions lacking clear correspondence. The Lotus and Medicago genomes share a minimum of 10 large-scale synteny blocks, each with substantial collinearity and frequently extending the length of whole chromosome arms. The proportion of genes syntenic and collinear within each synteny block is relatively homogeneous. Medicago-Lotus comparisons also indicate similar and largely homogeneous gene densities, although gene-containing regions in Mt occupy 20-30% more space than Lj counterparts, primarily because of larger numbers of Mt retrotransposons. Because the interpretation of genome comparisons is complicated by large-scale genome duplications, we describe synteny, synonymous substitutions and phylogenetic analyses to identify and date a probable whole-genome duplication event. There is no direct evidence for any recent large-scale genome duplication in either Medicago or Lotus but instead a duplication predating speciation. Phylogenetic comparisons place this duplication within the Rosid I clade, clearly after the split between legumes and Salicaceae (poplar).
Collapse
Affiliation(s)
- Steven B. Cannon
- Department of Plant Pathology, University of Minnesota, St. Paul, MN 55108
- U.S. Department of Agriculture–Agricultural Research Service and Department of Agronomy, Iowa State University, Ames, IA 50010
| | - Lieven Sterck
- Department of Plant Systems Biology (VIB), Ghent University, B-9052 Ghent, Belgium
| | - Stephane Rombauts
- Department of Plant Systems Biology (VIB), Ghent University, B-9052 Ghent, Belgium
| | - Shusei Sato
- Kazusa DNA Research Institute, Kisarazu, Chiba 292-0818, Japan
| | - Foo Cheung
- Institute for Genomic Research, Rockville, MD 20850
| | - Jérôme Gouzy
- Laboratoire des Interactions Plantes–Microorganismes, Institut National de la Recherche Agronomique–Centre National de la Recherche Scientifique, 31326 Castanet-Tolosan, France
| | - Xiaohong Wang
- Department of Plant Pathology, University of Minnesota, St. Paul, MN 55108
| | - Joann Mudge
- Department of Plant Pathology, University of Minnesota, St. Paul, MN 55108
| | | | - Thomas Schiex
- Unité de Biométrie et Intelligence Artificielle, B.P. 52627, Institut National de la Recherche Agronomique, 31326 Castanet-Tolosan, France
| | - Manuel Spannagl
- Munich Information Center for Protein Sequences Institute for Bioinformatics, Gesellschaft für Strahlung und Umweltforschung, Research Center for Environment and Health, 85764 Neuherberg, Germany
| | | | - Christine Nicholson
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Sean J. Humphray
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - Heiko Schoof
- Max Planck Institute for Plant Breeding Research, 50829 Köln, Germany
| | - Klaus F. X. Mayer
- Munich Information Center for Protein Sequences Institute for Bioinformatics, Gesellschaft für Strahlung und Umweltforschung, Research Center for Environment and Health, 85764 Neuherberg, Germany
| | - Jane Rogers
- Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom
| | | | | | - Frédéric Debellé
- Laboratoire des Interactions Plantes–Microorganismes, Institut National de la Recherche Agronomique–Centre National de la Recherche Scientifique, 31326 Castanet-Tolosan, France
| | - Douglas R. Cook
- Department of Plant Pathology, University of California, One Shields Avenue, Davis, CA 95616
| | - Ernest F. Retzel
- Center for Computational Genomics and Bioinformatics, Minneapolis, MN 55455; and
| | - Bruce A. Roe
- Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019
| | | | - Satoshi Tabata
- Kazusa DNA Research Institute, Kisarazu, Chiba 292-0818, Japan
| | - Yves Van de Peer
- Department of Plant Systems Biology (VIB), Ghent University, B-9052 Ghent, Belgium
| | - Nevin D. Young
- Department of Plant Pathology, University of Minnesota, St. Paul, MN 55108
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
40
|
Abstract
MOTIVATION Searching RNA gene occurrences in genomic sequences is a task whose importance has been renewed by the recent discovery of numerous functional RNA, often interacting with other ligands. Even if several programs exist for RNA motif search, none exists that can represent and solve the problem of searching for occurrences of RNA motifs in interaction with other molecules. RESULTS We present a constraint network formulation of this problem. RNA are represented as structured motifs that can occur on more than one sequence and which are related together by possible hybridization. The implemented tool MilPat is used to search for several sRNA families in genomic sequences. Results show that MilPat allows to efficiently search for interacting motifs in large genomic sequences and offers a simple and extensible framework to solve such problems. New and known sRNA are identified as H/ACA candidates in Methanocaldococcus jannaschii. AVAILABILITY http://carlit.toulouse.inra.fr/MilPaT/MilPat.pl.
Collapse
Affiliation(s)
- P Thébault
- Unité de Biométrie & Intelligence Artificielle, INRA, Chemin de Borde Rouge Auzeville, BP 52627, 31326 Castanet-Tolosan, France
| | | | | | | |
Collapse
|
41
|
|
42
|
Abstract
Background Alternative splicing (AS) is now considered as a major actor in transcriptome/proteome diversity and it cannot be neglected in the annotation process of a new genome. Despite considerable progresses in term of accuracy in computational gene prediction, the ability to reliably predict AS variants when there is local experimental evidence of it remains an open challenge for gene finders. Results We have used a new integrative approach that allows to incorporate AS detection into ab initio gene prediction. This method relies on the analysis of genomically aligned transcript sequences (ESTs and/or cDNAs), and has been implemented in the dynamic programming algorithm of the graph-based gene finder EuGÈNE. Given a genomic sequence and a set of aligned transcripts, this new version identifies the set of transcripts carrying evidence of alternative splicing events, and provides, in addition to the classical optimal gene prediction, alternative optimal predictions (among those which are consistent with the AS events detected). This allows for multiple annotations of a single gene in a way such that each predicted variant is supported by a transcript evidence (but not necessarily with a full-length coverage). Conclusions This automatic combination of experimental data analysis and ab initio gene finding offers an ideal integration of alternatively spliced gene prediction inside a single annotation pipeline.
Collapse
MESH Headings
- Algorithms
- Alternative Splicing
- Arabidopsis/genetics
- Codon
- Computer Graphics
- DNA, Complementary/metabolism
- Databases, Genetic
- Databases, Nucleic Acid
- Databases, Protein
- Exons
- Expressed Sequence Tags
- Gene Expression Profiling
- Genes, Plant
- Genome
- Genome, Human
- Genomics
- Humans
- Introns
- Models, Genetic
- Proteomics/methods
- RNA Splice Sites
- Sequence Alignment
- Sequence Analysis, Protein
- Sequence Analysis, RNA
- Software
- Transcription, Genetic
- User-Computer Interface
Collapse
Affiliation(s)
- Sylvain Foissac
- Unité de Biométrie et Intelligence Artificielle, INRA, 31326 Castanet Tolosan, France
| | - Thomas Schiex
- Unité de Biométrie et Intelligence Artificielle, INRA, 31326 Castanet Tolosan, France
| |
Collapse
|
43
|
Abstract
UNLABELLED CAR(H)(T)A GENE: is an integrated genetic and radiation hybrid (RH) mapping tool which can deal with multiple populations, including mixtures of genetic and RH data. CAR(H)(T)A GENE: performs multipoint maximum likelihood estimations with accelerated expectation-maximization algorithms for some pedigrees and has sophisticated algorithms for marker ordering. Dedicated heuristics for framework mapping are also included. CAR(H)(T)A GENE: can be used as a C++ library, through a shell command and a graphical interface. The XML output for companion tools is integrated. AVAILABILITY The program is available free of charge from www.inra.fr/bia/T/CarthaGene for Linux, Windows and Solaris machines (with Open Source). CONTACT tschiex@toulouse.inra.fr.
Collapse
Affiliation(s)
- Simon de Givry
- INRA, Biométrie et Intelligence Artificielle/Génétique Cellulaire, BP 27, 31326 Castanet-Tolosan Cedex, France
| | | | | | | | | |
Collapse
|
44
|
|
45
|
Schiex T, Gouzy J, Moisan A, de Oliveira Y. FrameD: A flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res 2003; 31:3738-41. [PMID: 12824407 PMCID: PMC169016 DOI: 10.1093/nar/gkg610] [Citation(s) in RCA: 81] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2003] [Revised: 04/14/2003] [Accepted: 04/14/2003] [Indexed: 11/13/2022] Open
Abstract
We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.
Collapse
Affiliation(s)
- Thomas Schiex
- Unité de Biométrie et Intelligence Artificielle, INRA, CNRS-INRA, 31326, Castanet Tolosan Cedex, France.
| | | | | | | |
Collapse
|
46
|
Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T. EUGENE'HOM: A generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 2003; 31:3742-5. [PMID: 12824408 PMCID: PMC168992 DOI: 10.1093/nar/gkg586] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
EUGENE'HOM is a gene prediction software for eukaryotic organisms based on comparative analysis. EUGENE'HOM is able to take into account multiple homologous sequences from more or less closely related organisms. It integrates the results of TBLASTX analysis, splice site and start codon prediction and a robust coding/non-coding probabilistic model which allows EUGENE'HOM to handle sequences from a variety of organisms. The current target of EUGENE'HOM is plant sequences. The EUGENE'HOM web site is available at http://genopole.toulouse.inra.fr/bioinfo/eugene/EuGeneHom/cgi-bin/EuGeneHom.pl.
Collapse
Affiliation(s)
- Sylvain Foissac
- Laboratoire de Biométrie et Intelligence Artificielle, INRA, 31326, Castanet Tolosan Cedex, France.
| | | | | | | | | |
Collapse
|
47
|
Demeure O, Renard C, Yerle M, Faraut T, Riquet J, Robic A, Schiex T, Rink A, Milan D. Rearranged gene order between pig and human in a QTL region on SSC 7. Mamm Genome 2003; 14:71-80. [PMID: 12532270 DOI: 10.1007/s00335-002-3034-1] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2002] [Accepted: 09/17/2002] [Indexed: 10/27/2022]
Abstract
On porcine Chromosome 7, the region surrounding the MHC region contains QTL influencing many traits including growth, back fat thickness, and carcass composition. Towards the identification of the responsible gene(s), this article describes an increase of density of the radiated hybrid map of SSC 7 in the q11-q14 region and the comparative analysis of gene order on the porcine RH map and human genome assembly. Adding 24 new genes in this region, we were able to build a framework map that fills in gaps on the previous maps. The new software Carthagene was used to build a robust framework in this region. Comparative analysis of human and porcine maps revealed a global conservation of gene order and of distances between genes. A rearranged fragment of around 3.7 Mb was, however, found in the pig approximately 20 Mb upstream from the expected location on the basis of the human map. This rearrangement, found by RH mapping on the IMpRH 7.000 rads panel, has been confirmed by two-color FISH and by mapping on the high resolution IMNpRH2 12.000 rads panel. The rearranged fragment contains two microsatellites found at the most likely QTL location in the INRA QTL experiment. It also contains the BMP5 gene, which, together with CLPS, could be considered as a possible candidate.
Collapse
Affiliation(s)
- Olivier Demeure
- Laboratoire de Génétique Cellulaire, INRA, BP 27, 31326 Castanet-Tolosan, France
| | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Journet EP, van Tuinen D, Gouzy J, Crespeau H, Carreau V, Farmer MJ, Niebel A, Schiex T, Jaillon O, Chatagnier O, Godiard L, Micheli F, Kahn D, Gianinazzi-Pearson V, Gamas P. Exploring root symbiotic programs in the model legume Medicago truncatula using EST analysis. Nucleic Acids Res 2002; 30:5579-92. [PMID: 12490726 PMCID: PMC140066 DOI: 10.1093/nar/gkf685] [Citation(s) in RCA: 174] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2002] [Revised: 10/18/2002] [Accepted: 10/18/2002] [Indexed: 11/13/2022] Open
Abstract
We report on a large-scale expressed sequence tag (EST) sequencing and analysis program aimed at characterizing the sets of genes expressed in roots of the model legume Medicago truncatula during interactions with either of two microsymbionts, the nitrogen-fixing bacterium Sinorhizobium meliloti or the arbuscular mycorrhizal fungus Glomus intraradices. We have designed specific tools for in silico analysis of EST data, in relation to chimeric cDNA detection, EST clustering, encoded protein prediction, and detection of differential expression. Our 21 473 5'- and 3'-ESTs could be grouped into 6359 EST clusters, corresponding to distinct virtual genes, along with 52 498 other M.truncatula ESTs available in the dbEST (NCBI) database that were recruited in the process. These clusters were manually annotated, using a specifically developed annotation interface. Analysis of EST cluster distribution in various M.truncatula cDNA libraries, supported by a refined R test to evaluate statistical significance and by 'electronic northern' representation, enabled us to identify a large number of novel genes predicted to be up- or down-regulated during either symbiotic root interaction. These in silico analyses provide a first global view of the genetic programs for root symbioses in M.truncatula. A searchable database has been built and can be accessed through a public interface.
Collapse
Affiliation(s)
- Etienne-Pascal Journet
- Laboratoire de Biologie Moléculaire des Relations Plantes-Microorganismes, CNRS-INRA, Laboratoire de Biométrie et Intelligence Artificielle, INRA, 31326 Castanet-Tolosan Cedex, France.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Mathé C, Sagot MF, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 2002; 30:4103-17. [PMID: 12364589 PMCID: PMC140543 DOI: 10.1093/nar/gkf543] [Citation(s) in RCA: 209] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2002] [Revised: 08/07/2002] [Accepted: 08/07/2002] [Indexed: 11/14/2022] Open
Abstract
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Collapse
Affiliation(s)
- Catherine Mathé
- Institut de Pharmacologie et Biologie Structurale, UMR 5089, 205 route de Narbonne, F-31077 Toulouse Cedex, France.
| | | | | | | |
Collapse
|
50
|
Salanoubat M, Genin S, Artiguenave F, Gouzy J, Mangenot S, Arlat M, Billault A, Brottier P, Camus JC, Cattolico L, Chandler M, Choisne N, Claudel-Renard C, Cunnac S, Demange N, Gaspin C, Lavie M, Moisan A, Robert C, Saurin W, Schiex T, Siguier P, Thébault P, Whalen M, Wincker P, Levy M, Weissenbach J, Boucher CA. Genome sequence of the plant pathogen Ralstonia solanacearum. Nature 2002; 415:497-502. [PMID: 11823852 DOI: 10.1038/415497a] [Citation(s) in RCA: 608] [Impact Index Per Article: 27.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Ralstonia solanacearum is a devastating, soil-borne plant pathogen with a global distribution and an unusually wide host range. It is a model system for the dissection of molecular determinants governing pathogenicity. We present here the complete genome sequence and its analysis of strain GMI1000. The 5.8-megabase (Mb) genome is organized into two replicons: a 3.7-Mb chromosome and a 2.1-Mb megaplasmid. Both replicons have a mosaic structure providing evidence for the acquisition of genes through horizontal gene transfer. Regions containing genetically mobile elements associated with the percentage of G+C bias may have an important function in genome evolution. The genome encodes many proteins potentially associated with a role in pathogenicity. In particular, many putative attachment factors were identified. The complete repertoire of type III secreted effector proteins can be studied. Over 40 candidates were identified. Comparison with other genomes suggests that bacterial plant pathogens and animal pathogens harbour distinct arrays of specialized type III-dependent effectors.
Collapse
Affiliation(s)
- M Salanoubat
- Genoscope and CNRS UMR-8030, 2 rue Gaston Crémieux, CP5706, 91057 Evry Cedex, France
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|