1
|
Zhu X, Xie S, Armengaud J, Xie W, Guo Z, Kang S, Wu Q, Wang S, Xia J, He R, Zhang Y. Tissue-specific Proteogenomic Analysis of Plutella xylostella Larval Midgut Using a Multialgorithm Pipeline. Mol Cell Proteomics 2016; 15:1791-807. [PMID: 26902207 PMCID: PMC5083088 DOI: 10.1074/mcp.m115.050989] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2015] [Revised: 02/04/2016] [Indexed: 11/06/2022] Open
Abstract
The diamondback moth, Plutella xylostella (L.), is the major cosmopolitan pest of brassica and other cruciferous crops. Its larval midgut is a dynamic tissue that interfaces with a wide variety of toxicological and physiological processes. The draft sequence of the P. xylostella genome was recently released, but its annotation remains challenging because of the low sequence coverage of this branch of life and the poor description of exon/intron splicing rules for these insects. Peptide sequencing by computational assignment of tandem mass spectra to genome sequence information provides an experimental independent approach for confirming or refuting protein predictions, a concept that has been termed proteogenomics. In this study, we carried out an in-depth proteogenomic analysis to complement genome annotation of P. xylostella larval midgut based on shotgun HPLC-ESI-MS/MS data by means of a multialgorithm pipeline. A total of 876,341 tandem mass spectra were searched against the predicted P. xylostella protein sequences and a whole-genome six-frame translation database. Based on a data set comprising 2694 novel genome search specific peptides, we discovered 439 novel protein-coding genes and corrected 128 existing gene models. To get the most accurate data to seed further insect genome annotation, more than half of the novel protein-coding genes, i.e. 235 over 439, were further validated after RT-PCR amplification and sequencing of the corresponding transcripts. Furthermore, we validated 53 novel alternative splicings. Finally, a total of 6764 proteins were identified, resulting in one of the most comprehensive proteogenomic study of a nonmodel animal. As the first tissue-specific proteogenomics analysis of P. xylostella, this study provides the fundamental basis for high-throughput proteomics and functional genomics approaches aimed at deciphering the molecular mechanisms of resistance and controlling this pest.
Collapse
Affiliation(s)
- Xun Zhu
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | | | - Jean Armengaud
- ¶CEA-Marcoule, DSV/IBITEC-S/SPI/Li2D, Laboratory, BP 17171, F-30200, Bagnols-sur-Cèze, F-30207, France
| | - Wen Xie
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Zhaojiang Guo
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Shi Kang
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Qingjun Wu
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Shaoli Wang
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Jixing Xia
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Rongjun He
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Youjun Zhang
- From the ‡Department of Plant Protection, Institute of Vegetables and Flowers, Chinese Academy of Agricultural Sciences, Beijing, 100081, China;
| |
Collapse
|
2
|
Berry IJ, Steele JR, Padula MP, Djordjevic SP. The application of terminomics for the identification of protein start sites and proteoforms in bacteria. Proteomics 2015; 16:257-72. [DOI: 10.1002/pmic.201500319] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Revised: 09/21/2015] [Accepted: 09/30/2015] [Indexed: 01/11/2023]
Affiliation(s)
- Iain J. Berry
- The ithree Institute; University of Technology Sydney; Broadway NSW Australia
- Proteomics Core Facility; University of Technology Sydney; Broadway NSW Australia
| | - Joel R. Steele
- Proteomics Core Facility; University of Technology Sydney; Broadway NSW Australia
| | - Matthew P. Padula
- The ithree Institute; University of Technology Sydney; Broadway NSW Australia
- Proteomics Core Facility; University of Technology Sydney; Broadway NSW Australia
| | - Steven P. Djordjevic
- The ithree Institute; University of Technology Sydney; Broadway NSW Australia
- Proteomics Core Facility; University of Technology Sydney; Broadway NSW Australia
| |
Collapse
|
3
|
Kumar D, Mondal AK, Kutum R, Dash D. Proteogenomics of rare taxonomic phyla: A prospective treasure trove of protein coding genes. Proteomics 2015; 16:226-40. [PMID: 26773550 DOI: 10.1002/pmic.201500263] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Revised: 09/18/2015] [Accepted: 09/28/2015] [Indexed: 01/04/2023]
Abstract
Sustainable innovations in sequencing technologies have resulted in a torrent of microbial genome sequencing projects. However, the prokaryotic genomes sequenced so far are unequally distributed along their phylogenetic tree; few phyla contain the majority, the rest only a few representatives. Accurate genome annotation lags far behind genome sequencing. While automated computational prediction, aided by comparative genomics, remains a popular choice for genome annotation, substantial fraction of these annotations are erroneous. Proteogenomics utilizes protein level experimental observations to annotate protein coding genes on a genome wide scale. Benefits of proteogenomics include discovery and correction of gene annotations regardless of their phylogenetic conservation. This not only allows detection of common, conserved proteins but also the discovery of protein products of rare genes that may be horizontally transferred or taxonomy specific. Chances of encountering such genes are more in rare phyla that comprise a small number of complete genome sequences. We collated all bacterial and archaeal proteogenomic studies carried out to date and reviewed them in the context of genome sequencing projects. Here, we present a comprehensive list of microbial proteogenomic studies, their taxonomic distribution, and also urge for targeted proteogenomics of underexplored taxa to build an extensive reference of protein coding genes.
Collapse
Affiliation(s)
- Dhirendra Kumar
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| | - Anupam Kumar Mondal
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| | - Rintu Kutum
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| | - Debasis Dash
- G. N. Ramachandran Knowledge Center of Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Delhi, India
| |
Collapse
|
4
|
Zickmann F, Renard BY. IPred - integrating ab initio and evidence based gene predictions to improve prediction accuracy. BMC Genomics 2015; 16:134. [PMID: 25766582 PMCID: PMC4345001 DOI: 10.1186/s12864-015-1315-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2014] [Accepted: 02/03/2015] [Indexed: 11/21/2022] Open
Abstract
Background Gene prediction is a challenging but crucial part in most genome analysis pipelines. Various methods have evolved that predict genes ab initio on reference sequences or evidence based with the help of additional information, such as RNA-Seq reads or EST libraries. However, none of these strategies is bias-free and one method alone does not necessarily provide a complete set of accurate predictions. Results We present IPred (Integrative gene Prediction), a method to integrate ab initio and evidence based gene identifications to complement the advantages of different prediction strategies. IPred builds on the output of gene finders and generates a new combined set of gene identifications, representing the integrated evidence of the single method predictions. Conclusion We evaluate IPred in simulations and real data experiments on Escherichia Coli and human data. We show that IPred improves the prediction accuracy in comparison to single method predictions and to existing methods for prediction combination. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1315-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Franziska Zickmann
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Berlin, Germany.
| | - Bernhard Y Renard
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Berlin, Germany.
| |
Collapse
|
5
|
Kucharova V, Wiker HG. Proteogenomics in microbiology: taking the right turn at the junction of genomics and proteomics. Proteomics 2014; 14:2360-675. [PMID: 25263021 DOI: 10.1002/pmic.201400168] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Revised: 08/18/2014] [Accepted: 09/23/2014] [Indexed: 12/14/2022]
Abstract
High-accuracy and high-throughput proteomic methods have completely changed the way we can identify and characterize proteins. MS-based proteomics can now provide a unique supplement to genomic data and add a new level of information to the interpretation of genomic sequences. Proteomics-driven genome annotation has become especially relevant in microbiology where genomes are sequenced on a daily basis and limitations of an in silico driven annotation process are well recognized. In this review paper, we outline different strategies on how one can design a proteogenomic experiment, for example on genome-sequenced (synonymous proteogenomics) versus unsequenced organisms (ortho-proteogenomics) or with the aid of other "omic" data such as RNA-seq. We touch upon many challenges that are encountered during a typical proteogenomic study, mostly concerning bioinformatics methods and downstream data analysis, but also related to creation and use of sequence databases. A large list of proteogenomic case studies of different microorganisms is provided to illustrate the mapping of MS/MS-derived peptide spectra to genomic DNA sequences. These investigations have led to accurate determination of translational initiation sites, pointed out eventual read-throughs or programmed frameshifts, detected signal peptide processing or other protein maturation events, removed questionable annotation assignments, and provided evidence for predicted hypothetical proteins.
Collapse
Affiliation(s)
- Veronika Kucharova
- Department of Clinical Science, The Gade Research Group for Infection and Immunity, University of Bergen, Norway
| | | |
Collapse
|
6
|
Meijer HJG, Mancuso FM, Espadas G, Seidl MF, Chiva C, Govers F, Sabidó E. Profiling the secretome and extracellular proteome of the potato late blight pathogen Phytophthora infestans. Mol Cell Proteomics 2014; 13:2101-13. [PMID: 24872595 PMCID: PMC4125740 DOI: 10.1074/mcp.m113.035873] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2013] [Revised: 05/09/2014] [Indexed: 11/06/2022] Open
Abstract
Oomycetes are filamentous organisms that cause notorious diseases, several of which have a high economic impact. Well known is Phytophthora infestans, the causal agent of potato late blight. Previously, in silico analyses of the genome and transcriptome of P. infestans resulted in the annotation of a large number of genes encoding proteins with an N-terminal signal peptide. This set is collectively referred to as the secretome and comprises proteins involved in, for example, cell wall growth and modification, proteolytic processes, and the promotion of successful invasion of plant cells. So far, proteomic profiling in oomycetes was primarily focused on subcellular, intracellular or cell wall fractions; the extracellular proteome has not been studied systematically. Here we present the first comprehensive characterization of the in vivo secretome and extracellular proteome of P. infestans. We have used mass spectrometry to analyze P. infestans proteins present in seven different growth media with mycelial cultures and this resulted in the consistent identification of over two hundred proteins. Gene ontology classification pinpointed proteins involved in cell wall modifications, pathogenesis, defense responses, and proteolytic processes. Moreover, we found members of the RXLR and CRN effector families as well as several proteins lacking an obvious signal peptide. The latter were confirmed to be bona fide extracellular proteins and this suggests that, similar to other organisms, oomycetes exploit non-conventional secretion mechanisms to transfer certain proteins to the extracellular environment.
Collapse
Affiliation(s)
- Harold J G Meijer
- From the ‡Laboratory of Phytopathology, Wageningen University, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| | - Francesco M Mancuso
- §Proteomics Unit, Center of Genomics Regulation (CRG), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain; ¶Proteomics Unit, Universitat Pompeu Fabra (UPF), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Guadalupe Espadas
- §Proteomics Unit, Center of Genomics Regulation (CRG), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain; ¶Proteomics Unit, Universitat Pompeu Fabra (UPF), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Michael F Seidl
- From the ‡Laboratory of Phytopathology, Wageningen University, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands; ‖Centre for BioSystems Genomics, Droevendaalsesteeg, 16708 PB Wageningen, The Netherlands
| | - Cristina Chiva
- §Proteomics Unit, Center of Genomics Regulation (CRG), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain; ¶Proteomics Unit, Universitat Pompeu Fabra (UPF), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Francine Govers
- From the ‡Laboratory of Phytopathology, Wageningen University, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands; ‖Centre for BioSystems Genomics, Droevendaalsesteeg, 16708 PB Wageningen, The Netherlands
| | - Eduard Sabidó
- §Proteomics Unit, Center of Genomics Regulation (CRG), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain; ¶Proteomics Unit, Universitat Pompeu Fabra (UPF), Carrer Dr. Aiguader 88, 08003 Barcelona, Spain;
| |
Collapse
|
7
|
Bland C, Hartmann EM, Christie-Oleza JA, Fernandez B, Armengaud J. N-Terminal-oriented proteogenomics of the marine bacterium roseobacter denitrificans Och114 using N-Succinimidyloxycarbonylmethyl)tris(2,4,6-trimethoxyphenyl)phosphonium bromide (TMPP) labeling and diagonal chromatography. Mol Cell Proteomics 2014; 13:1369-81. [PMID: 24536027 DOI: 10.1074/mcp.o113.032854] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Given the ease of whole genome sequencing with next-generation sequencers, structural and functional gene annotation is now purely based on automated prediction. However, errors in gene structure are frequent, the correct determination of start codons being one of the main concerns. Here, we combine protein N termini derivatization using (N-Succinimidyloxycarbonylmethyl)tris(2,4,6-trimethoxyphenyl)phosphonium bromide (TMPP Ac-OSu) as a labeling reagent with the COmbined FRActional DIagonal Chromatography (COFRADIC) sorting method to enrich labeled N-terminal peptides for mass spectrometry detection. Protein digestion was performed in parallel with three proteases to obtain a reliable automatic validation of protein N termini. The analysis of these N-terminal enriched fractions by high-resolution tandem mass spectrometry allowed the annotation refinement of 534 proteins of the model marine bacterium Roseobacter denitrificans OCh114. This study is especially efficient regarding mass spectrometry analytical time. From the 534 validated N termini, 480 confirmed existing gene annotations, 41 highlighted erroneous start codon annotations, five revealed totally new mis-annotated genes; the mass spectrometry data also suggested the existence of multiple start sites for eight different genes, a result that challenges the current view of protein translation initiation. Finally, we identified several proteins for which classical genome homology-driven annotation was inconsistent, questioning the validity of automatic annotation pipelines and emphasizing the need for complementary proteomic data. All data have been deposited to the ProteomeXchange with identifier PXD000337.
Collapse
Affiliation(s)
- Céline Bland
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze, F-30207, France
| | | | | | | | | |
Collapse
|
8
|
Abstract
MOTIVATION The reliable identification of genes is a major challenge in genome research, as further analysis depends on the correctness of this initial step. With high-throughput RNA-Seq data reflecting currently expressed genes, a particularly meaningful source of information has become commonly available for gene finding. However, practical application in automated gene identification is still not the standard case. A particular challenge in including RNA-Seq data is the difficult handling of ambiguously mapped reads. RESULTS We present GIIRA (Gene Identification Incorporating RNA-Seq data and Ambiguous reads), a novel prokaryotic and eukaryotic gene finder that is exclusively based on a RNA-Seq mapping and inherently includes ambiguously mapped reads. GIIRA extracts candidate regions supported by a sufficient number of mappings and reassigns ambiguous reads to their most likely origin using a maximum-flow approach. This avoids the exclusion of genes that are predominantly supported by ambiguous mappings. Evaluation on simulated and real data and comparison with existing methods incorporating RNA-Seq information highlight the accuracy of GIIRA in identifying the expressed genes. AVAILABILITY AND IMPLEMENTATION GIIRA is implemented in Java and is available from https://sourceforge.net/projects/giira/.
Collapse
Affiliation(s)
- Franziska Zickmann
- Research Group Bioinformatics (NG4), Robert Koch-Institute, Nordufer 20, 13353 Berlin, Germany
| | | | | |
Collapse
|
9
|
Abstract
Metaproteomic studies of whole microbial communities from environmental samples (e.g., soil, sediments, freshwater, seawater, etc.) have rapidly increased in recent years due to many technological advances in mass spectrometry (MS). A single 24-h liquid chromatograph-tandem mass spectrometry (LC-MS/MS) measurement can potentially detect and quantify thousands of proteins from many dominant and subdominant naturally occurring microbial populations. Importantly, amino acid sequences and relative abundance information for detected peptides are determined, which allows for the characterization of expressed protein functions within communities and specific matches to be made to microbial lineages, with potential subspecies resolution. Continued optimization of protein extraction and fractionation protocols, development of quantification methods, and advances in mass spectrometry instrumentation are enabling more accurate and comprehensive peptide detection within samples, leading to wider research applicability, greater ease of use, and overall accessibility. This chapter provides a brief overview of metaproteomics experimental options, including a general protocol for sample handling and LC-MS/MS measurement.
Collapse
Affiliation(s)
- Ryan S Mueller
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA.
| | | |
Collapse
|
10
|
Volkening JD, Bailey DJ, Rose CM, Grimsrud PA, Howes-Podoll M, Venkateshwaran M, Westphall MS, Ané JM, Coon JJ, Sussman MR. A proteogenomic survey of the Medicago truncatula genome. Mol Cell Proteomics 2012; 11:933-44. [PMID: 22774004 DOI: 10.1074/mcp.m112.019471] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Peptide sequencing by computational assignment of tandem mass spectra to a database of putative protein sequences provides an independent approach to confirming or refuting protein predictions based on large-scale DNA and RNA sequencing efforts. This use of mass spectrometrically-derived sequence data for testing and refining predicted gene models has been termed proteogenomics. We report herein the application of proteogenomic methodology to a database of 10.9 million tandem mass spectra collected over a period of two years from proteolytically generated peptides isolated from the model legume Medicago truncatula. These spectra were searched against a database of predicted M. truncatula protein sequences generated from public databases, in silico gene model predictions, and a whole-genome six-frame translation. This search identified 78,647 distinct peptide sequences, and a comparison with the publicly available proteome from the recently published M. truncatula genome supported translation of 9,843 existing gene models and identified 1,568 novel peptides suggesting corrections or additions to the current annotations. Each supporting and novel peptide was independently validated using mRNA-derived deep sequencing coverage and an overall correlation of 93% between the two data types was observed. We have additionally highlighted examples of several aspects of structural annotation for which tandem MS provides unique evidence not easily obtainable through typical DNA or RNA sequencing. Proteogenomic analysis is a valuable and unique source of information for the structural annotation of genomes and should be included in such efforts to ensure that the genome models used by biologists mirror as accurately as possible what is present in the cell.
Collapse
Affiliation(s)
- Jeremy D Volkening
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin 53706, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Savidor A, Teper D, Gartemann KH, Eichenlaub R, Chalupowicz L, Manulis-Sasson S, Barash I, Tews H, Mayer K, Giannone RJ, Hettich RL, Sessa G. The Clavibacter michiganensis subsp. michiganensis–Tomato Interactome Reveals the Perception of Pathogen by the Host and Suggests Mechanisms of Infection. J Proteome Res 2011; 11:736-50. [DOI: 10.1021/pr200646a] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Alon Savidor
- Department of Molecular Biology and Ecology of Plants, Tel Aviv University, Tel Aviv 69978, Israel
| | - Doron Teper
- Department of Molecular Biology and Ecology of Plants, Tel Aviv University, Tel Aviv 69978, Israel
| | - Karl-Heinz Gartemann
- Department of Genetechnology/Microbiology, Faculty of Biology, University of Bielefeld, 33501 Bielefeld, Germany
| | - Rudolf Eichenlaub
- Department of Genetechnology/Microbiology, Faculty of Biology, University of Bielefeld, 33501 Bielefeld, Germany
| | - Laura Chalupowicz
- Department of Plant Pathology and Weed Research, ARO, The Volcani Center, Bet Dagan 50250, Israel
| | - Shulamit Manulis-Sasson
- Department of Plant Pathology and Weed Research, ARO, The Volcani Center, Bet Dagan 50250, Israel
| | - Isaac Barash
- Department of Molecular Biology and Ecology of Plants, Tel Aviv University, Tel Aviv 69978, Israel
| | - Helena Tews
- Department of Genetechnology/Microbiology, Faculty of Biology, University of Bielefeld, 33501 Bielefeld, Germany
| | - Kerstin Mayer
- Department of Genetechnology/Microbiology, Faculty of Biology, University of Bielefeld, 33501 Bielefeld, Germany
| | - Richard J. Giannone
- Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Robert L. Hettich
- Chemical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, United States
| | - Guido Sessa
- Department of Molecular Biology and Ecology of Plants, Tel Aviv University, Tel Aviv 69978, Israel
| |
Collapse
|
12
|
Castellana N, Bafna V. Proteogenomics to discover the full coding content of genomes: a computational perspective. J Proteomics 2010; 73:2124-35. [PMID: 20620248 DOI: 10.1016/j.jprot.2010.06.007] [Citation(s) in RCA: 134] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2010] [Revised: 06/04/2010] [Accepted: 06/21/2010] [Indexed: 11/16/2022]
Abstract
Proteogenomics has emerged as a field at the junction of genomics and proteomics. It is a loose collection of technologies that allow the search of tandem mass spectra against genomic databases to identify and characterize protein-coding genes. Proteogenomic peptides provide invaluable information for gene annotation, which is difficult or impossible to ascertain using standard annotation methods. Examples include confirmation of translation, reading-frame determination, identification of gene and exon boundaries, evidence for post-translational processing, identification of splice-forms including alternative splicing, and also, prediction of completely novel genes. For proteogenomics to deliver on its promise, however, it must overcome a number of technological hurdles, including speed and accuracy of peptide identification, construction and search of specialized databases, correction of sampling bias, and others. This article reviews the state of the art of the field, focusing on the current successes, and the role of computation in overcoming these challenges. We describe how technological and algorithmic advances have already enabled large-scale proteogenomic studies in many model organisms, including arabidopsis, yeast, fly, and human. We also provide a preview of the field going forward, describing early efforts in tackling the problems of complex gene structures, searching against genomes of related species, and immunoglobulin gene reconstruction.
Collapse
Affiliation(s)
- Natalie Castellana
- Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093-0404, USA
| | | |
Collapse
|
13
|
Abstract
Alternative splicing (AS) and processing of pre-messenger RNAs explains the discrepancy between the number of genes and proteome complexity in multicellular eukaryotic organisms. However, relatively few alternative protein isoforms have been experimentally identified, particularly at the protein level. In this study, we assess the ability of proteomics to inform on differently spliced protein isoforms in human and four other model eukaryotes. The number of Ensembl-annotated genes for which proteomic data exists that informs on AS exceeds 33% of the alternately spliced genes in the human and worm genomes. Examining AS in chicken via proteomics for the first time, we find support for over 600 AS genes. However, although peptide identifications support only a small fraction of alternative protein isoforms that are annotated in Ensembl, many more variants are amenable to proteomic identification. There remains a sizeable gap between these existing identifications (10-52% of AS genes) and those that are theoretically feasible (90-99%). We also compare annotations between Swiss-Prot and Ensembl, recommending use of both to maximize coverage of AS. We propose that targeted proteomic experiments using selected reactions and standards are essential to uncover further alternative isoforms and discuss the issues surrounding these strategies.
Collapse
Affiliation(s)
- Paul Blakeley
- Faculty of Life Sciences, Michael Smith Building, University of Manchester, Manchester, UK
| | | | | | | |
Collapse
|
14
|
Thompson MR, Chourey K, Froelich JM, Erickson BK, VerBerkmoes NC, Hettich RL. Experimental approach for deep proteome measurements from small-scale microbial biomass samples. Anal Chem 2009; 80:9517-25. [PMID: 19072265 DOI: 10.1021/ac801707s] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Many methods of microbial proteome characterizations require large quantities of cellular biomass (>1-2 g) for sample preparation and protein identification. Our experimental approach differs from traditional techniques by providing the ability to identify the proteomic state of a microbe from a few milligrams of starting cellular material. The small-scale, guanidine lysis method minimizes sample loss by achieving cellular lysis and protein digestion in a single-tube experiment. For this experimental approach, the freshwater microbe Shewanella oneidensis MR-1 and the purple non-sulfur bacterium Rhodopseudomonas palustris CGA0010 were used as model organisms for technology development and evaluation. A 2-D LC-MS/MS comparison between a standard sonication lysis method and the small-scale guanidine lysis techniques demonstrates that the guanidine lysis method is more efficient with smaller sample amounts of cell pellet (i.e., down to 1 mg). The described methodology enables deeper proteome measurements from a few milliliters of confluent bacterial cultures. We also report a new protocol for efficient lysis from small amounts of natural biofilm samples for deep proteome measurements, which should greatly enhance the emerging field of environmental microbial community proteomics. This straightforward sample boiling protocol is complementary to the small-scale guanidine lysis technique, is amenable for small sample quantities, and requires no special reagents that might complicate the MS measurements.
Collapse
Affiliation(s)
- Melissa R Thompson
- Graduate School of Genome Science and Technology, Oak Ridge National Laboratory-University of Tennessee, Knoxville, Tennessee 37830, USA
| | | | | | | | | | | |
Collapse
|
15
|
Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol 2009; 12:292-300. [PMID: 19410500 DOI: 10.1016/j.mib.2009.03.005] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2009] [Revised: 03/26/2009] [Accepted: 03/26/2009] [Indexed: 11/17/2022]
Abstract
High-throughput identification of proteins and their accurate partial sequencing by shotgun nanoLC-MS/MS are now feasible for any cellular model at a full genomic scale. Proteogenomics is the integration of these data with the genome. Mining microbial proteomes allows validation of predicted orphan genes and correction of genome annotation errors such as discovery of unannotated genes, reversal of reading frames and identification of translational start sites, stop codon read-throughs or programmed frameshifts. Recent advances have been achieved in database searches, N-terminal oriented proteomics and homology-driven proteogenomics. From now on, proteogenomics on newly sequenced model genomes can be carried out at the earliest stage of the genome project as already exemplified by Mycoplasma mobile and Deinococcus deserti genomes. The proteomics and genomics alliance produces almost complete and accurate gene catalogues for small microbial genomes, a comprehensiveness which is essential for efficient systems biology.
Collapse
Affiliation(s)
- Jean Armengaud
- CEA, DSV, IBEB, Lab Biochim System Perturb, Bagnols-sur-Cèze, France.
| |
Collapse
|
16
|
VerBerkmoes NC, Denef VJ, Hettich RL, Banfield JF. Functional analysis of natural microbial consortia using community proteomics. Nat Rev Microbiol 2009; 7:196-205. [DOI: 10.1038/nrmicro2080] [Citation(s) in RCA: 195] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
17
|
Castellana NE, Payne SH, Shen Z, Stanke M, Bafna V, Briggs SP. Discovery and revision of Arabidopsis genes by proteogenomics. Proc Natl Acad Sci U S A 2008; 105:21034-8. [PMID: 19098097 DOI: 10.1073/pnas.0811066106] [Citation(s) in RCA: 232] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Gene annotation underpins genome science. Most often protein coding sequence is inferred from the genome based on transcript evidence and computational predictions. While generally correct, gene models suffer from errors in reading frame, exon border definition, and exon identification. To ascertain the error rate of Arabidopsis thaliana gene models, we isolated proteins from a sample of Arabidopsis tissues and determined the amino acid sequences of 144,079 distinct peptides by tandem mass spectrometry. The peptides corresponded to 1 or more of 3 different translations of the genome: a 6-frame translation, an exon splice-graph, and the currently annotated proteome. The majority of the peptides (126,055) resided in existing gene models (12,769 confirmed proteins), comprising 40% of annotated genes. Surprisingly, 18,024 novel peptides were found that do not correspond to annotated genes. Using the gene finding program AUGUSTUS and 5,426 novel peptides that occurred in clusters, we discovered 778 new protein-coding genes and refined the annotation of an additional 695 gene models. The remaining 13,449 novel peptides provide high quality annotation (>99% correct) for thousands of additional genes. Our observation that 18,024 of 144,079 peptides did not match current gene models suggests that 13% of the Arabidopsis proteome was incomplete due to approximately equal numbers of missing and incorrect gene models.
Collapse
|
18
|
Kim S, Gupta N, Bandeira N, Pevzner PA. Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol Cell Proteomics 2008; 8:53-69. [PMID: 18703573 DOI: 10.1074/mcp.m800103-mcp200] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Database search tools identify peptides by matching tandem mass spectra against a protein database. We study an alternative approach when all plausible de novo interpretations of a spectrum (spectral dictionary) are generated and then quickly matched against the database. We present a new MS-Dictionary algorithm for efficiently generating spectral dictionaries and demonstrate that MS-Dictionary can identify spectra that are missed in the database search. We argue that MS-Dictionary enables proteogenomics searches in six-frame translation of genomic sequences that may be prohibitively time-consuming for existing database search approaches. We show that such searches allow one to correct sequencing errors and find programmed frameshifts.
Collapse
Affiliation(s)
- Sangtae Kim
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA
| | | | | | | |
Collapse
|
19
|
Savidor A, Donahoo RS, Hurtado-Gonzales O, Land ML, Shah MB, Lamour KH, McDonald WH. Cross-species global proteomics reveals conserved and unique processes in Phytophthora sojae and Phytophthora ramorum. Mol Cell Proteomics 2008; 7:1501-16. [PMID: 18316789 PMCID: PMC2500229 DOI: 10.1074/mcp.m700431-mcp200] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2007] [Revised: 01/23/2008] [Indexed: 11/06/2022] Open
Abstract
Phytophthora ramorum and Phytophthora sojae are destructive plant pathogens. P. sojae has a narrow host range, whereas P. ramorum has a wide host range. A global proteomics comparison of the vegetative (mycelium) and infective (germinating cyst) life stages of P. sojae and P. ramorum was conducted to identify candidate proteins involved in host range, early infection, and vegetative growth. Sixty-two candidates for early infection, 26 candidates for vegetative growth, and numerous proteins that may be involved in defining host specificity were identified. In addition, common life stage proteomic trends between the organisms were observed. In mycelia, proteins involved in transport and metabolism of amino acids, carbohydrates, and other small molecules were up-regulated. In the germinating cysts, up-regulated proteins associated with lipid transport and metabolism, cytoskeleton, and protein synthesis were observed. It appears that the germinating cyst catabolizes lipid reserves through the beta-oxidation pathway to drive the extensive protein synthesis necessary to produce the germ tube and initiate infection. Once inside the host, the pathogen switches to vegetative growth in which energy is derived from glycolysis and utilized for synthesis of amino acids and other molecules that assist survival in the plant tissue.
Collapse
Affiliation(s)
- Alon Savidor
- Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory Oak Ridge, Oak Ridge, Tennessee 37830, USA
| | | | | | | | | | | | | |
Collapse
|
20
|
Gupta N, Benhamida J, Bhargava V, Goodman D, Kain E, Kerman I, Nguyen N, Ollikainen N, Rodriguez J, Wang J, Lipton MS, Romine M, Bafna V, Smith RD, Pevzner PA. Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes. Genome Res 2008; 18:1133-42. [PMID: 18426904 DOI: 10.1101/gr.074344.107] [Citation(s) in RCA: 94] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Recent proliferation of low-cost DNA sequencing techniques will soon lead to an explosive growth in the number of sequenced genomes and will turn manual annotations into a luxury. Mass spectrometry recently emerged as a valuable technique for proteogenomic annotations that improves on the state-of-the-art in predicting genes and other features. However, previous proteogenomic approaches were limited to a single genome and did not take advantage of analyzing mass spectrometry data from multiple genomes at once. We show that such a comparative proteogenomics approach (like comparative genomics) allows one to address the problems that remained beyond the reach of the traditional "single proteome" approach in mass spectrometry. In particular, we show how comparative proteogenomics addresses the notoriously difficult problem of "one-hit-wonders" in proteomics, improves on the existing gene prediction tools in genomics, and allows identification of rare post-translational modifications. We therefore argue that complementing DNA sequencing projects by comparative proteogenomics projects can be a viable approach to improve both genomic and proteomic annotations.
Collapse
Affiliation(s)
- Nitin Gupta
- Bioinformatics Program, University of California San Diego, La Jolla, California 92093, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
Ferro M, Tardif M, Reguer E, Cahuzac R, Bruley C, Vermat T, Nugues E, Vigouroux M, Vandenbrouck Y, Garin J, Viari A. PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. J Proteome Res 2008; 7:1873-83. [PMID: 18348511 DOI: 10.1021/pr070415k] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
PepLine is a fully automated software which maps MS/MS fragmentation spectra of trypsic peptides to genomic DNA sequences. The approach is based on Peptide Sequence Tags (PSTs) obtained from partial interpretation of QTOF MS/MS spectra (first module). PSTs are then mapped on the six-frame translations of genomic sequences (second module) giving hits. Hits are then clustered to detect potential coding regions (third module). Our work aimed at optimizing the algorithms of each component to allow the whole pipeline to proceed in a fully automated manner using raw nucleic acid sequences (i.e., genomes that have not been "reduced" to a database of ORFs or putative exons sequences). The whole pipeline was tested on controlled MS/MS spectra sets from standard proteins and from Arabidopsis thaliana envelope chloroplast samples. Our results demonstrate that PepLine competed with protein database searching softwares and was fast enough to potentially tackle large data sets and/or high size genomes. We also illustrate the potential of this approach for the detection of the intron/exon structure of genes.
Collapse
Affiliation(s)
- Myriam Ferro
- CEA, DSV, iRTSV, Laboratoire d'Etude de la Dynamique des Protéomes, Grenoble, F-38054, France
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Lucitt MB, Price TS, Pizarro A, Wu W, Yocum AK, Seiler C, Pack MA, Blair IA, Fitzgerald GA, Grosser T. Analysis of the zebrafish proteome during embryonic development. Mol Cell Proteomics 2008; 7:981-94. [PMID: 18212345 DOI: 10.1074/mcp.m700382-mcp200] [Citation(s) in RCA: 101] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
The model organism zebrafish (Danio rerio) is particularly amenable to studies deciphering regulatory genetic networks in vertebrate development, biology, and pharmacology. Unraveling the functional dynamics of such networks requires precise quantitation of protein expression during organismal growth, which is incrementally challenging with progressive complexity of the systems. In an approach toward such quantitative studies of dynamic network behavior, we applied mass spectrometric methodology and rigorous statistical analysis to create comprehensive, high quality profiles of proteins expressed at two stages of zebrafish development. Proteins of embryos 72 and 120 h postfertilization (hpf) were isolated and analyzed both by two-dimensional (2D) LC followed by ESI-MS/MS and by 2D PAGE followed by MALDI-TOF/TOF protein identification. We detected 1384 proteins from 327,906 peptide sequence identifications at 72 and 120 hpf with false identification rates of less than 1% using 2D LC-ESI-MS/MS. These included only approximately 30% of proteins that were identified by 2D PAGE-MALDI-TOF/TOF. Roughly 10% of all detected proteins were derived from hypothetical or predicted gene models or were entirely unannotated. Comparison of proteins expression by 2D DIGE revealed that proteins involved in energy production and transcription/translation were relatively more abundant at 72 hpf consistent with faster synthesis of cellular proteins during organismal growth at this time compared with 120 hpf. The data are accessible in a database that links protein identifications to existing resources including the Zebrafish Information Network database. This new resource should facilitate the selection of candidate proteins for targeted quantitation and refine systematic genetic network analysis in vertebrate development and biology.
Collapse
Affiliation(s)
- Margaret B Lucitt
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Abstract
The oomycetes form a distinct phylogenetic lineage of fungus-like eukaryotic microorganisms that are relatively closely related to photosynthetic algae such as brown algae and diatoms. Plant pathogenic species, notably those of the genus Phytophthora, are the best-studied oomycetes. The genomes of four Phytophthora and one downy mildew species were recently sequenced resulting in novel insights on the evolution and pathogenesis of oomycetes. This review highlights key findings that emerged from these studies and discusses the future challenges for oomycete research.
Collapse
Affiliation(s)
- Kurt H Lamour
- Department of Entomology and Plant Pathology, The University of Tennessee, Knoxville, TN, USA
| | | | | |
Collapse
|