1
|
Maia GA, Filho VB, Kawagoe EK, Teixeira Soratto TA, Moreira RS, Grisard EC, Wagner G. AnnotaPipeline: An integrated tool to annotate eukaryotic proteins using multi-omics data. Front Genet 2022; 13:1020100. [DOI: 10.3389/fgene.2022.1020100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 11/11/2022] [Indexed: 11/23/2022] Open
Abstract
Assignment of gene function has been a crucial, laborious, and time-consuming step in genomics. Due to a variety of sequencing platforms that generates increasing amounts of data, manual annotation is no longer feasible. Thus, the need for an integrated, automated pipeline allowing the use of experimental data towards validation of in silico prediction of gene function is of utmost relevance. Here, we present a computational workflow named AnnotaPipeline that integrates distinct software and data types on a proteogenomic approach to annotate and validate predicted features in genomic sequences. Based on FASTA (i) nucleotide or (ii) protein sequences or (iii) structural annotation files (GFF3), users can input FASTQ RNA-seq data, MS/MS data from mzXML or similar formats, as the pipeline uses both transcriptomic and proteomic information to corroborate annotations and validate gene prediction, providing transcription and expression evidence for functional annotation. Reannotation of the available Arabidopsis thaliana, Caenorhabditis elegans, Candida albicans, Trypanosoma cruzi, and Trypanosoma rangeli genomes was performed using the AnnotaPipeline, resulting in a higher proportion of annotated proteins and a reduced proportion of hypothetical proteins when compared to the annotations publicly available for these organisms. AnnotaPipeline is a Unix-based pipeline developed using Python and is available at: https://github.com/bioinformatics-ufsc/AnnotaPipeline.
Collapse
|
2
|
Vitorino R, Choudhury M, Guedes S, Ferreira R, Thongboonkerd V, Sharma L, Amado F, Srivastava S. Peptidomics and proteogenomics: background, challenges and future needs. Expert Rev Proteomics 2021; 18:643-659. [PMID: 34517741 DOI: 10.1080/14789450.2021.1980388] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
INTRODUCTION With available genomic data and related information, it is becoming possible to better highlight mutations or genomic alterations associated with a particular disease or disorder. The advent of high-throughput sequencing technologies has greatly advanced diagnostics, prognostics, and drug development. AREAS COVERED Peptidomics and proteogenomics are the two post-genomic technologies that enable the simultaneous study of peptides and proteins/transcripts/genes. Both technologies add a remarkably large amount of data to the pool of information on various peptides associated with gene mutations or genome remodeling. Literature search was performed in the PubMed database and is up to date. EXPERT OPINION This article lists various techniques used for peptidomic and proteogenomic analyses. It also explains various bioinformatics workflows developed to understand differentially expressed peptides/proteins and their role in disease pathogenesis. Their role in deciphering disease pathways, cancer research, and biomarker discovery using biofluids is highlighted. Finally, the challenges and future requirements to overcome the current limitations for their effective clinical use are also discussed.
Collapse
Affiliation(s)
- Rui Vitorino
- Faculdade de Medicina da Universidade do Porto, Porto, Portugal.,iBiMED, Department of Medical Sciences, University of Aveiro, Aveiro, Portugal.,Laqv/requimte, Department of Chemistry, University of Aveiro, Aveiro, Portugal
| | - Manisha Choudhury
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Mumbai, Powai, India
| | - Sofia Guedes
- Laqv/requimte, Department of Chemistry, University of Aveiro, Aveiro, Portugal
| | - Rita Ferreira
- Laqv/requimte, Department of Chemistry, University of Aveiro, Aveiro, Portugal
| | - Visith Thongboonkerd
- Medical Proteomics Unit, Office for Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | | | - Francisco Amado
- Laqv/requimte, Department of Chemistry, University of Aveiro, Aveiro, Portugal
| | - Sanjeeva Srivastava
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Mumbai, Powai, India
| |
Collapse
|
3
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
4
|
Caminati G, Procacci P. Mounting evidence of FKBP12 implication in neurodegeneration. Neural Regen Res 2020; 15:2195-2202. [PMID: 32594030 PMCID: PMC7749462 DOI: 10.4103/1673-5374.284980] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 02/18/2020] [Accepted: 03/24/2020] [Indexed: 12/25/2022] Open
Abstract
Intrinsically disordered proteins, such as tau or α-synuclein, have long been associated with a dysfunctional role in neurodegenerative diseases. In Alzheimer's and Parkinson's' diseases, these proteins, sharing a common chemical-physical pattern with alternating hydrophobic and hydrophilic domains rich in prolines, abnormally aggregate in tangles in the brain leading to progressive loss of neurons. In this review, we present an overview linking the studies on the implication of the peptidyl-prolyl isomerase domain of immunophilins, and notably FKBP12, to a variety of neurodegenerative diseases, focusing on the molecular origin of such a role. The involvement of FKBP12 dysregulation in the aberrant aggregation of disordered proteins pinpoints this protein as a possible therapeutic target and, at the same time, as a predictive biomarker for early diagnosis in neurodegeneration, calling for the development of reliable, fast and cost-effective detection methods in body fluids for community-based screening campaigns.
Collapse
Affiliation(s)
- Gabriella Caminati
- Department of Chemistry “Ugo Schiff”, University of Florence, Sesto Fiorentino, Italy
- Center for Colloid and Surface Science (CSGI), University of Florence, Sesto Fiorentino, Italy
| | - Piero Procacci
- Department of Chemistry “Ugo Schiff”, University of Florence, Sesto Fiorentino, Italy
| |
Collapse
|
5
|
Shukla N, Siva N, Malik B, Suravajhala P. Current Challenges and Implications of Proteogenomic Approaches in Prostate Cancer. Curr Top Med Chem 2020; 20:1968-1980. [PMID: 32703135 DOI: 10.2174/1568026620666200722112450] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 05/30/2020] [Accepted: 06/29/2020] [Indexed: 12/16/2022]
Abstract
In the recent past, next-generation sequencing (NGS) approaches have heralded the omics era. With NGS data burgeoning, there arose a need to disseminate the omic data better. Proteogenomics has been vividly used for characterising the functions of candidate genes and is applied in ascertaining various diseased phenotypes, including cancers. However, not much is known about the role and application of proteogenomics, especially Prostate Cancer (PCa). In this review, we outline the need for proteogenomic approaches, their applications and their role in PCa.
Collapse
Affiliation(s)
- Nidhi Shukla
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India.,Department of Chemistry, School of Basic Sciences, Manipal University Jaipur, Jaipur, India
| | - Narmadhaa Siva
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India
| | - Babita Malik
- Department of Chemistry, School of Basic Sciences, Manipal University Jaipur, Jaipur, India
| | - Prashanth Suravajhala
- Department of Biotechnology and Bioinformatics, Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, RJ, India
| |
Collapse
|
6
|
Goodhead I, Blow F, Brownridge P, Hughes M, Kenny J, Krishna R, McLean L, Pongchaikul P, Beynon R, Darby AC. Large-scale and significant expression from pseudogenes in Sodalis glossinidius - a facultative bacterial endosymbiont. Microb Genom 2020; 6:e000285. [PMID: 31922467 PMCID: PMC7067036 DOI: 10.1099/mgen.0.000285] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Accepted: 07/10/2019] [Indexed: 01/30/2023] Open
Abstract
The majority of bacterial genomes have high coding efficiencies, but there are some genomes of intracellular bacteria that have low gene density. The genome of the endosymbiont Sodalis glossinidius contains almost 50 % pseudogenes containing mutations that putatively silence them at the genomic level. We have applied multiple 'omic' strategies, combining Illumina and Pacific Biosciences Single-Molecule Real-Time DNA sequencing and annotation, stranded RNA sequencing and proteome analysis to better understand the transcriptional and translational landscape of Sodalis pseudogenes, and potential mechanisms for their control. Between 53 and 74 % of the Sodalis transcriptome remains active in cell-free culture. The mean sense transcription from coding domain sequences (CDSs) is four times greater than that from pseudogenes. Comparative genomic analysis of six Illumina-sequenced Sodalis isolates from different host Glossina species shows pseudogenes make up ~40 % of the 2729 genes in the core genome, suggesting that they are stable and/or that Sodalis is a recent introduction across the genus Glossina as a facultative symbiont. These data shed further light on the importance of transcriptional and translational control in deciphering host-microbe interactions. The combination of genomics, transcriptomics and proteomics gives a multidimensional perspective for studying prokaryotic genomes with a view to elucidating evolutionary adaptation to novel environmental niches.
Collapse
Affiliation(s)
- Ian Goodhead
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
- School of Science, Engineering and Environment, Peel Building, University of Salford, M5 4WT, UK
| | - Frances Blow
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
- Department of Entomology, Cornell University, Ithaca 14853, NY, USA
| | - Philip Brownridge
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - Margaret Hughes
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
- Centre for Genomic Research, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - John Kenny
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
- Centre for Genomic Research, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - Ritesh Krishna
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
- IBM Research UK, STFC Daresbury Laboratory, Warrington, WA4 4AD, UK
| | - Lynn McLean
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - Pisut Pongchaikul
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - Rob Beynon
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - Alistair C. Darby
- Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
- Centre for Genomic Research, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| |
Collapse
|
7
|
Schlaffner CN, Pirklbauer GJ, Bender A, Choudhary JS. Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes. Cell Syst 2019; 5:152-156.e4. [PMID: 28837811 PMCID: PMC5571441 DOI: 10.1016/j.cels.2017.07.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Revised: 03/24/2017] [Accepted: 07/26/2017] [Indexed: 12/24/2022]
Abstract
Current tools for visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies and capture only basic sequence identity information. Furthermore, the frequent reformatting of annotations for reference genomes required by these tools is known to be highly error prone. We developed PoGo for mapping peptides identified through mass spectrometry to overcome these limitations. PoGo reduced runtime and memory usage by 85% and 20%, respectively, and exhibited overall superior performance over other tools on benchmarking with large-scale human tissue and cancer phosphoproteome datasets comprising ∼3 million peptides. In addition, extended functionality enables representation of single-nucleotide variants, post-translational modifications, and quantitative features. PoGo has been integrated in established frameworks such as the PRIDE tool suite and OpenMS, as well as a standalone tool with user-friendly graphical interface. With the rapid increase of quantitative high-resolution datasets capturing proteomes and global modifications to complement orthogonal genomics platforms, PoGo provides a central utility enabling large-scale visualization and interpretation of transomics datasets.
Collapse
Affiliation(s)
- Christoph N Schlaffner
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK; Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, Cambridgeshire CB2 1EW, UK.
| | - Georg J Pirklbauer
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, Cambridgeshire CB2 1EW, UK
| | - Jyoti S Choudhary
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1SA, UK
| |
Collapse
|
8
|
Guillot L, Delage L, Viari A, Vandenbrouck Y, Com E, Ritter A, Lavigne R, Marie D, Peterlongo P, Potin P, Pineau C. Peptimapper: proteogenomics workflow for the expert annotation of eukaryotic genomes. BMC Genomics 2019; 20:56. [PMID: 30654742 PMCID: PMC6337836 DOI: 10.1186/s12864-019-5431-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Accepted: 01/03/2019] [Indexed: 01/02/2023] Open
Abstract
Background Accurate structural annotation of genomes is still a challenge, despite the progress made over the past decade. The prediction of gene structure remains difficult, especially for eukaryotic species, and is often erroneous and incomplete. We used a proteogenomics strategy, taking advantage of the combination of proteomics datasets and bioinformatics tools, to identify novel protein coding-genes and splice isoforms, assign correct start sites, and validate predicted exons and genes. Results Our proteogenomics workflow, Peptimapper, was applied to the genome annotation of Ectocarpus sp., a key reference genome for both the brown algal lineage and stramenopiles. We generated proteomics data from various life cycle stages of Ectocarpus sp. strains and sub-cellular fractions using a shotgun approach. First, we directly generated peptide sequence tags (PSTs) from the proteomics data. Second, we mapped PSTs onto the translated genomic sequence. Closely located hits (i.e., PSTs locations on the genome) were then clustered to detect potential coding regions based on parameters optimized for the organism. Third, we evaluated each cluster and compared it to gene predictions from existing conventional genome annotation approaches. Finally, we integrated cluster locations into GFF files to use a genome viewer. We identified two potential novel genes, a ribosomal protein L22 and an aryl sulfotransferase and corrected the gene structure of a dihydrolipoamide acetyltransferase. We experimentally validated the results by RT-PCR and using transcriptomics data. Conclusions Peptimapper is a complementary tool for the expert annotation of genomes. It is suitable for any organism and is distributed through a Docker image available on two public bioinformatics docker repositories: Docker Hub and BioShaDock. This workflow is also accessible through the Galaxy framework and for use by non-computer scientists at https://galaxy.protim.eu. Data are available via ProteomeXchange under identifier PXD010618. Electronic supplementary material The online version of this article (10.1186/s12864-019-5431-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Laetitia Guillot
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France.,Protim, Univ Rennes, F-35042, Rennes cedex, France
| | - Ludovic Delage
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France
| | - Alain Viari
- INRIA Grenoble-Rhône-Alpes, F-38330, Montbonnot-Saint-Martin, France
| | - Yves Vandenbrouck
- University Grenoble Alpes, CEA, Inserm, BIG-BGE, 38000, Grenoble, France
| | - Emmanuelle Com
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France.,Protim, Univ Rennes, F-35042, Rennes cedex, France
| | - Andrés Ritter
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France.,Present address: Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratory of Computational and Quantitative Biology, F-75005, Paris, France
| | - Régis Lavigne
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France.,Protim, Univ Rennes, F-35042, Rennes cedex, France
| | - Dominique Marie
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France
| | | | - Philippe Potin
- Sorbonne Université, UPMC, CNRS, UMR 8227, Integrative Biology of Marine Models, Biological Station, CS 90074, F-29688, Roscoff, France
| | - Charles Pineau
- Univ Rennes, Inserm, EHESP, Irset (Institut de recherche en santé, environnement et travail) - UMR_S 1085, F-35042, Rennes cedex, France. .,Protim, Univ Rennes, F-35042, Rennes cedex, France.
| |
Collapse
|
9
|
Ren Z, Qi D, Pugh N, Li K, Wen B, Zhou R, Xu S, Liu S, Jones AR. Improvements to the Rice Genome Annotation Through Large-Scale Analysis of RNA-Seq and Proteomics Data Sets. Mol Cell Proteomics 2019; 18:86-98. [PMID: 30293062 PMCID: PMC6317475 DOI: 10.1074/mcp.ra118.000832] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2018] [Revised: 08/31/2018] [Indexed: 01/22/2023] Open
Abstract
Rice (Oryza sativa) is one of the most important worldwide crops. The genome has been available for over 10 years and has undergone several rounds of annotation. We created a comprehensive database of transcripts from 29 public RNA sequencing data sets, officially predicted genes from Ensembl plants, and common contaminants in which to search for protein-level evidence. We re-analyzed nine publicly accessible rice proteomics data sets. In total, we identified 420K peptide spectrum matches from 47K peptides and 8,187 protein groups. 4168 peptides were initially classed as putative novel peptides (not matching official genes). Following a strict filtration scheme to rule out other possible explanations, we discovered 1,584 high confidence novel peptides. The novel peptides were clustered into 692 genomic loci where our results suggest annotation improvements. 80% of the novel peptides had an ortholog match in the curated protein sequence set from at least one other plant species. For the peptides clustering in intergenic regions (and thus potentially new genes), 101 loci were identified, for which 43 had a high-confidence hit for a protein domain. Our results can be displayed as tracks on the Ensembl genome or other browsers supporting Track Hubs, to support re-annotation of the rice genome.
Collapse
Affiliation(s)
- Zhe Ren
- From the ‡BGI-Shenzhen, Shenzhen 518083, China
| | - Da Qi
- From the ‡BGI-Shenzhen, Shenzhen 518083, China
| | - Nina Pugh
- §Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK
| | - Kai Li
- From the ‡BGI-Shenzhen, Shenzhen 518083, China
| | - Bo Wen
- ‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030;; ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030
| | - Ruo Zhou
- From the ‡BGI-Shenzhen, Shenzhen 518083, China
| | - Shaohang Xu
- From the ‡BGI-Shenzhen, Shenzhen 518083, China
| | - Siqi Liu
- From the ‡BGI-Shenzhen, Shenzhen 518083, China;.
| | - Andrew R Jones
- §Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK;.
| |
Collapse
|
10
|
Low TY, Mohtar MA, Ang MY, Jamal R. Connecting Proteomics to Next‐Generation Sequencing: Proteogenomics and Its Current Applications in Biology. Proteomics 2018; 19:e1800235. [DOI: 10.1002/pmic.201800235] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2018] [Revised: 10/09/2018] [Indexed: 12/17/2022]
Affiliation(s)
- Teck Yew Low
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - M. Aiman Mohtar
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - Mia Yang Ang
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| | - Rahman Jamal
- UKM Medical Molecular Biology Institute (UMBI)Universiti Kebangsaan Malaysia 56000 Kuala Lumpur Malaysia
| |
Collapse
|
11
|
Lee PY, Chin SF, Low TY, Jamal R. Probing the colorectal cancer proteome for biomarkers: Current status and perspectives. J Proteomics 2018; 187:93-105. [PMID: 29953962 DOI: 10.1016/j.jprot.2018.06.014] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Revised: 06/13/2018] [Accepted: 06/23/2018] [Indexed: 02/07/2023]
Abstract
Colorectal cancer (CRC) is one of the most prevalent malignancies worldwide. Biomarkers that can facilitate better clinical management of CRC are in high demand to improve patient outcome and to reduce mortality. In this regard, proteomic analysis holds a promising prospect in the hunt of novel biomarkers for CRC and in understanding the mechanisms underlying tumorigenesis. This review aims to provide an overview of the current progress of proteomic research, focusing on discovery and validation of diagnostic biomarkers for CRC. We will summarize the contributions of proteomic strategies to recent discoveries of protein biomarkers for CRC and also briefly discuss the potential and challenges of different proteomic approaches in biomarker discovery and translational applications.
Collapse
Affiliation(s)
- Pey Yee Lee
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia.
| | - Siok-Fong Chin
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia
| | - Teck Yew Low
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia
| | - Rahman Jamal
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia
| |
Collapse
|
12
|
Schlaffner CN, Pirklbauer GJ, Bender A, Steen JAJ, Choudhary JS. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes. J Vis Exp 2018. [PMID: 29889196 PMCID: PMC6101353 DOI: 10.3791/57633] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Cross-talk between genes, transcripts, and proteins is the key to cellular responses; hence, analysis of molecular levels as distinct entities is slowly being extended to integrative studies to enhance the understanding of molecular dynamics within cells. Current tools for the visualization and integration of proteomics with other omics datasets are inadequate for large-scale studies. Furthermore, they only capture basic sequence identify, discarding post-translational modifications and quantitation. To address these issues, we developed PoGo to map peptides with associated post-translational modifications and quantification to reference genome annotation. In addition, the tool was developed to enable the mapping of peptides identified from customized sequence databases incorporating single amino acid variants. While PoGo is a command line tool, the graphical interface PoGoGUI enables non-bioinformatics researchers to easily map peptides to 25 species supported by Ensembl genome annotation. The generated output borrows file formats from the genomics field and, therefore, visualization is supported in most genome browsers. For large-scale studies, PoGo is supported by TrackHubGenerator to create web-accessible repositories of data mapped to genomes that also enable an easy sharing of proteogenomics data. With little effort, this tool can map millions of peptides to reference genomes within only a few minutes, outperforming other available sequence-identity based tools. This protocol demonstrates the best approaches for proteogenomics mapping through PoGo with publicly available datasets of quantitative and phosphoproteomics, as well as large-scale studies.
Collapse
Affiliation(s)
- Christoph N Schlaffner
- Department of Neurobiology, F. M. Kirby Neurobiology Center, Boston Children's Hospital, Harvard Medical School; Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus; Centre for Molecular Informatics, Department of Chemistry, University of Cambridge;
| | - Georg J Pirklbauer
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge
| | - Judith A J Steen
- Department of Neurobiology, F. M. Kirby Neurobiology Center, Boston Children's Hospital, Harvard Medical School
| | - Jyoti S Choudhary
- Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Wellcome Genome Campus; Functional Proteomics Group, Chester Beatty Laboratories, Institute of Cancer Research
| |
Collapse
|
13
|
Menschaert G, Wang X, Jones AR, Ghali F, Fenyö D, Olexiouk V, Zhang B, Deutsch EW, Ternent T, Vizcaíno JA. The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data. Genome Biol 2018; 19:12. [PMID: 29386051 PMCID: PMC5793360 DOI: 10.1186/s13059-017-1377-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 12/07/2017] [Indexed: 01/23/2023] Open
Abstract
On behalf of The Human Proteome Organization (HUPO) Proteomics Standards Initiative, we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box." We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis. All the related documentation, including the detailed file format specifications and example files, are accessible at http://www.psidev.info/probam and at http://www.psidev.info/probed .
Collapse
Affiliation(s)
- Gerben Menschaert
- Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Coupure links 653, 9000, Gent, Belgium.
| | - Xiaojing Wang
- Greehey Children's Cancer Research Institute, The University of Texas Health Science Center at San Antonio, San Antonio, TX, USA. .,Department of Epidemiology and Biostatistics, The University of Texas Health Science Center at San Antonio, San Antonio, TX, USA.
| | - Andrew R Jones
- Institute of Integrative Biology, University of Liverpool, Liverpool, UK
| | - Fawaz Ghali
- Institute of Integrative Biology, University of Liverpool, Liverpool, UK.,School of Computing, Mathematics and Digital Technology, Manchester Metropolitan University, Manchester, M1 5GD, UK
| | - David Fenyö
- Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, NY, USA.,Institute for Systems Genetics, New York University School of Medicine, New York, NY, USA
| | - Volodimir Olexiouk
- Department of Mathematical Modeling, Statistics and Bioinformatics, Ghent University, Coupure links 653, 9000, Gent, Belgium
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, USA.,Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | | | - Tobias Ternent
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Juan Antonio Vizcaíno
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
14
|
Menschaert G, David F. Proteogenomics from a bioinformatics angle: A growing field. MASS SPECTROMETRY REVIEWS 2017; 36:584-599. [PMID: 26670565 PMCID: PMC6101030 DOI: 10.1002/mas.21483] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/01/2015] [Indexed: 05/16/2023]
Abstract
Proteogenomics is a research area that combines areas as proteomics and genomics in a multi-omics setup using both mass spectrometry and high-throughput sequencing technologies. Currently, the main goals of the field are to aid genome annotation or to unravel the proteome complexity. Mass spectrometry based identifications of matching or homologues peptides can further refine gene models. Also, the identification of novel proteoforms is also made possible based on detection of novel translation initiation sites (cognate or near-cognate), novel transcript isoforms, sequence variation or novel (small) open reading frames in intergenic or un-translated genic regions by analyzing high-throughput sequencing data from RNAseq or ribosome profiling experiments. Other proteogenomics studies using a combination of proteomics and genomics techniques focus on antibody sequencing, the identification of immunogenic peptides or venom peptides. Over the years, a growing amount of bioinformatics tools and databases became available to help streamlining these cross-omics studies. Some of these solutions only help in specific steps of the proteogenomics studies, e.g. building custom sequence databases (based on next generation sequencing output) for mass spectrometry fragmentation spectrum matching. Over the last few years a handful integrative tools also became available that can execute complete proteogenomics analyses. Some of these are presented as stand-alone solutions, whereas others are implemented in a web-based framework such as Galaxy. In this review we aimed at sketching a comprehensive overview of all the bioinformatics solutions that are available for this growing research area. © 2015 Wiley Periodicals, Inc. Mass Spec Rev 36:584-599, 2017.
Collapse
Affiliation(s)
- Gerben Menschaert
- Lab of Bioinformatics and Computational Genomics, Department of
Mathematical Modeling, Statistics and Bioinformatics, Faculty of Bioscience
Engineering, Ghent University, Ghent, Belgium
- To whom correspondence should be addressed. Tel:
+32 9 264 99 22; Fax: +32 9 264 6220;
| | - Fenyö David
- Center for Health Informatics and Bioinformatics and Department of
Biochemistry and Molecular Pharmacology, New York University School of Medicine, New
York, New York, USA
| |
Collapse
|
15
|
Komor MA, Pham TV, Hiemstra AC, Piersma SR, Bolijn AS, Schelfhorst T, Delis-van Diemen PM, Tijssen M, Sebra RP, Ashby M, Meijer GA, Jimenez CR, Fijneman RJA. Identification of Differentially Expressed Splice Variants by the Proteogenomic Pipeline Splicify. Mol Cell Proteomics 2017; 16:1850-1863. [PMID: 28747380 DOI: 10.1074/mcp.tir117.000056] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2017] [Indexed: 12/20/2022] Open
Abstract
Proteogenomics, i.e. comprehensive integration of genomics and proteomics data, is a powerful approach identifying novel protein biomarkers. This is especially the case for proteins that differ structurally between disease and control conditions. As tumor development is associated with aberrant splicing, we focus on this rich source of cancer specific biomarkers. To this end, we developed a proteogenomic pipeline, Splicify, which can detect differentially expressed protein isoforms. Splicify is based on integrating RNA massive parallel sequencing data and tandem mass spectrometry proteomics data to identify protein isoforms resulting from differential splicing between two conditions. Proof of concept was obtained by applying Splicify to RNA sequencing and mass spectrometry data obtained from colorectal cancer cell line SW480, before and after siRNA-mediated downmodulation of the splicing factors SF3B1 and SRSF1. These analyses revealed 2172 and 149 differentially expressed isoforms, respectively, with peptide confirmation upon knock-down of SF3B1 and SRSF1 compared with their controls. Splice variants identified included RAC1, OSBPL3, MKI67, and SYK. One additional sample was analyzed by PacBio Iso-Seq full-length transcript sequencing after SF3B1 downmodulation. This analysis verified the alternative splicing identified by Splicify and in addition identified novel splicing events that were not represented in the human reference genome annotation. Therefore, Splicify offers a validated proteogenomic data analysis pipeline for identification of disease specific protein biomarkers resulting from mRNA alternative splicing. Splicify is publicly available on GitHub (https://github.com/NKI-TGO/SPLICIFY) and suitable to address basic research questions using pre-clinical model systems as well as translational research questions using patient-derived samples, e.g. allowing to identify clinically relevant biomarkers.
Collapse
Affiliation(s)
- Malgorzata A Komor
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands.,§Oncoproteomics Laboratory, Department of Medical Oncology, VU University Medical Center, Amsterdam, the Netherlands
| | - Thang V Pham
- §Oncoproteomics Laboratory, Department of Medical Oncology, VU University Medical Center, Amsterdam, the Netherlands
| | - Annemieke C Hiemstra
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Sander R Piersma
- §Oncoproteomics Laboratory, Department of Medical Oncology, VU University Medical Center, Amsterdam, the Netherlands
| | - Anne S Bolijn
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Tim Schelfhorst
- §Oncoproteomics Laboratory, Department of Medical Oncology, VU University Medical Center, Amsterdam, the Netherlands
| | - Pien M Delis-van Diemen
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Marianne Tijssen
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Robert P Sebra
- ¶School of Medicine at Mount Sinai, Institute for Genomics and Multiscale Biology, New York, New York
| | | | - Gerrit A Meijer
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands
| | - Connie R Jimenez
- §Oncoproteomics Laboratory, Department of Medical Oncology, VU University Medical Center, Amsterdam, the Netherlands
| | - Remond J A Fijneman
- From the ‡Translational Gastrointestinal Oncology, Department of Pathology, Netherlands Cancer Institute, Amsterdam, the Netherlands;
| |
Collapse
|
16
|
Vizcaíno JA, Mayer G, Perkins S, Barsnes H, Vaudel M, Perez-Riverol Y, Ternent T, Uszkoreit J, Eisenacher M, Fischer L, Rappsilber J, Netz E, Walzer M, Kohlbacher O, Leitner A, Chalkley RJ, Ghali F, Martínez-Bartolomé S, Deutsch EW, Jones AR. The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics. Mol Cell Proteomics 2017; 16:1275-1285. [PMID: 28515314 PMCID: PMC5500760 DOI: 10.1074/mcp.m117.068429] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Revised: 05/15/2017] [Indexed: 12/31/2022] Open
Abstract
The first stable version of the Proteomics Standards Initiative mzIdentML open data standard (version 1.1) was published in 2012-capturing the outputs of peptide and protein identification software. In the intervening years, the standard has become well-supported in both commercial and open software, as well as a submission and download format for public repositories. Here we report a new release of mzIdentML (version 1.2) that is required to keep pace with emerging practice in proteome informatics. New features have been added to support: (1) scores associated with localization of modifications on peptides; (2) statistics performed at the level of peptides; (3) identification of cross-linked peptides; and (4) support for proteogenomics approaches. In addition, there is now improved support for the encoding of de novo sequencing of peptides, spectral library searches, and protein inference. As a key point, the underlying XML schema has only undergone very minor modifications to simplify as much as possible the transition from version 1.1 to version 1.2 for implementers, but there have been several notable updates to the format specification, implementation guidelines, controlled vocabularies and validation software. mzIdentML 1.2 can be described as backwards compatible, in that reading software designed for mzIdentML 1.1 should function in most cases without adaptation. We anticipate that these developments will provide a continued stable base for software teams working to implement the standard. All the related documentation is accessible at http://www.psidev.info/mzidentml.
Collapse
Affiliation(s)
- Juan Antonio Vizcaíno
- From the ‡European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Gerhard Mayer
- §Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Simon Perkins
- ¶Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK
| | - Harald Barsnes
- ‖Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- **Computational Biology Unit, Department of Informatics, University of Bergen, Norway
- ‡‡KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
| | - Marc Vaudel
- ‖Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- ‡‡KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- §§Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| | - Yasset Perez-Riverol
- From the ‡European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Tobias Ternent
- From the ‡European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Julian Uszkoreit
- §Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Martin Eisenacher
- §Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Lutz Fischer
- ¶¶Wellcome Trust Centre for Cell Biology, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom
| | - Juri Rappsilber
- ¶¶Wellcome Trust Centre for Cell Biology, University of Edinburgh, Edinburgh EH9 3BF, United Kingdom
- ‖‖Chair of Bioanalytics, Institute of Biotechnology Technische Universität Berlin, 13355 Berlin, Germany
| | - Eugen Netz
- Biomolecular Interactions group, Max Planck Institute for Developmental Biology, Tübingen D-72076, Germany
| | - Mathias Walzer
- Center for Bioinformatics, University of Tübingen, 72076 Tübingen, Germany
| | - Oliver Kohlbacher
- Biomolecular Interactions group, Max Planck Institute for Developmental Biology, Tübingen D-72076, Germany
- Center for Bioinformatics, University of Tübingen, 72076 Tübingen, Germany
- Dept. of Computer Science, University of Tübingen, Germany
- Quantitative Biology Center, University of Tübingen, Germany
| | - Alexander Leitner
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Auguste-Piccard-Hof 1, 8093 Zurich, Switzerland
| | - Robert J Chalkley
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, 94143
| | - Fawaz Ghali
- ¶Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK
| | - Salvador Martínez-Bartolomé
- Department of Chemical Physiology, The Scripps Research Institute, 10550, N. Torrey Pines Rd., La Jolla, California, 92037
| | | | - Andrew R Jones
- ¶Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, UK;
| |
Collapse
|
17
|
Ruggles KV, Krug K, Wang X, Clauser KR, Wang J, Payne SH, Fenyö D, Zhang B, Mani DR. Methods, Tools and Current Perspectives in Proteogenomics. Mol Cell Proteomics 2017; 16:959-981. [PMID: 28456751 DOI: 10.1074/mcp.mr117.000024] [Citation(s) in RCA: 95] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/20/2022] Open
Abstract
With combined technological advancements in high-throughput next-generation sequencing and deep mass spectrometry-based proteomics, proteogenomics, i.e. the integrative analysis of proteomic and genomic data, has emerged as a new research field. Early efforts in the field were focused on improving protein identification using sample-specific genomic and transcriptomic sequencing data. More recently, integrative analysis of quantitative measurements from genomic and proteomic studies have identified novel insights into gene expression regulation, cell signaling, and disease. Many methods and tools have been developed or adapted to enable an array of integrative proteogenomic approaches and in this article, we systematically classify published methods and tools into four major categories, (1) Sequence-centric proteogenomics; (2) Analysis of proteogenomic relationships; (3) Integrative modeling of proteogenomic data; and (4) Data sharing and visualization. We provide a comprehensive review of methods and available tools in each category and highlight their typical applications.
Collapse
Affiliation(s)
- Kelly V Ruggles
- From the ‡Department of Medicine, New York University School of Medicine, New York, New York 10016
| | - Karsten Krug
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Xiaojing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Karl R Clauser
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142
| | - Jing Wang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030.,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - Samuel H Payne
- **Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99354
| | - David Fenyö
- ‡‡Department of Biochemistry and Molecular Pharmacology, New York University School of Medicine, New York, New York 10016; .,§§Institute for Systems Genetics, New York University School of Medicine, New York, New York 10016
| | - Bing Zhang
- ¶Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, Texas 77030; .,‖Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030
| | - D R Mani
- §The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142;
| |
Collapse
|
18
|
Fu S, Liu X, Luo M, Xie K, Nice EC, Zhang H, Huang C. Proteogenomic studies on cancer drug resistance: towards biomarker discovery and target identification. Expert Rev Proteomics 2017; 14:351-362. [PMID: 28276747 DOI: 10.1080/14789450.2017.1299006] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
INTRODUCTION Chemoresistance is a major obstacle for current cancer treatment. Proteogenomics is a powerful multi-omics research field that uses customized protein sequence databases generated by genomic and transcriptomic information to identify novel genes (e.g. noncoding, mutation and fusion genes) from mass spectrometry-based proteomic data. By identifying aberrations that are differentially expressed between tumor and normal pairs, this approach can also be applied to validate protein variants in cancer, which may reveal the response to drug treatment. Areas covered: In this review, we will present recent advances in proteogenomic investigations of cancer drug resistance with an emphasis on integrative proteogenomic pipelines and the biomarker discovery which contributes to achieving the goal of using precision/personalized medicine for cancer treatment. Expert commentary: The discovery and comprehensive understanding of potential biomarkers help identify the cohort of patients who may benefit from particular treatments, and will assist real-time clinical decision-making to maximize therapeutic efficacy and minimize adverse effects. With the development of MS-based proteomics and NGS-based sequencing, a growing number of proteogenomic tools are being developed specifically to investigate cancer drug resistance.
Collapse
Affiliation(s)
- Shuyue Fu
- a State Key Laboratory of Biotherapy and Cancer Center , West China Hospital, Sichuan University, and Collaborative Innovation Center for Biotherapy , Chengdu , P.R. China
| | - Xiang Liu
- b Department of Pathology , Sichuan Academy of Medical Sciences, Sichuan Provincial People's Hospital , Chengdu , P.R. China
| | - Maochao Luo
- c West China School of Public Health, Sichuan University , Chengdu , P.R.China
| | - Ke Xie
- d Department of Oncology , Sichuan Academy of Medical Sciences, Sichuan Provincial People's Hospital , Chengdu , P.R. China
| | - Edouard C Nice
- e Department of Biochemistry and Molecular Biology , Monash University , Clayton , Australia
| | - Haiyuan Zhang
- f School of Medicine , Yangtze University , P. R. China
| | - Canhua Huang
- a State Key Laboratory of Biotherapy and Cancer Center , West China Hospital, Sichuan University, and Collaborative Innovation Center for Biotherapy , Chengdu , P.R. China
| |
Collapse
|
19
|
Kumar D, Yadav AK, Dash D. Choosing an Optimal Database for Protein Identification from Tandem Mass Spectrometry Data. Methods Mol Biol 2017; 1549:17-29. [PMID: 27975281 DOI: 10.1007/978-1-4939-6740-7_3] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Database searching is the preferred method for protein identification from digital spectra of mass to charge ratios (m/z) detected for protein samples through mass spectrometers. The search database is one of the major influencing factors in discovering proteins present in the sample and thus in deriving biological conclusions. In most cases the choice of search database is arbitrary. Here we describe common search databases used in proteomic studies and their impact on final list of identified proteins. We also elaborate upon factors like composition and size of the search database that can influence the protein identification process. In conclusion, we suggest that choice of the database depends on the type of inferences to be derived from proteomics data. However, making additional efforts to build a compact and concise database for a targeted question should generally be rewarding in achieving confident protein identifications.
Collapse
Affiliation(s)
- Dhirendra Kumar
- G.N. Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Mathura Road, Delhi, 110025, India
| | - Amit Kumar Yadav
- G.N. Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Mathura Road, Delhi, 110025, India
| | - Debasis Dash
- G.N. Ramachandran Knowledge Centre for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, Mathura Road, Delhi, 110025, India.
| |
Collapse
|
20
|
|
21
|
Weisser H, Wright JC, Mudge JM, Gutenbrunner P, Choudhary JS. Flexible Data Analysis Pipeline for High-Confidence Proteogenomics. J Proteome Res 2016; 15:4686-4695. [PMID: 27786492 PMCID: PMC5703597 DOI: 10.1021/acs.jproteome.6b00765] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
![]()
Proteogenomics leverages information
derived from proteomic data
to improve genome annotations. Of particular interest are “novel”
peptides that provide direct evidence of protein expression for genomic
regions not previously annotated as protein-coding. We present a modular,
automated data analysis pipeline aimed at detecting such “novel”
peptides in proteomic data sets. This pipeline implements criteria
developed by proteomics and genome annotation experts for high-stringency
peptide identification and filtering. Our pipeline is based on the
OpenMS computational framework; it incorporates multiple database
search engines for peptide identification and applies a machine-learning
approach (Percolator) to post-process search results. We describe
several new and improved software tools that we developed to facilitate
proteogenomic analyses that enhance the wealth of tools provided by
OpenMS. We demonstrate the application of our pipeline to a human
testis tissue data set previously acquired for the Chromosome-Centric
Human Proteome Project, which led to the addition of five new gene
annotations on the human reference genome.
Collapse
Affiliation(s)
| | | | | | - Petra Gutenbrunner
- School of Informatics, Communications, and Media, University of Applied Sciences Upper Austria , Hagenberg 4232, Austria
| | | |
Collapse
|
22
|
Abstract
Omics approaches have become popular in biology as powerful discovery tools, and currently gain in interest for diagnostic applications. Establishing the accurate genome sequence of any organism is easy, but the outcome of its annotation by means of automatic pipelines remains imprecise. Some protein-encoding genes may be missed as soon as they are specific and poorly conserved in a given taxon, while important to explain the specific traits of the organism. Translational starts are also poorly predicted in a relatively important number of cases, thus impacting the protein sequence database used in proteomics, comparative genomics, and systems biology. The use of high-throughput proteomics data to improve genome annotation is an attractive option to obtain a more comprehensive molecular picture of a given organism. Here, protocols for reannotating prokaryote genomes are described based on shotgun proteomics and derivatization of protein N-termini with a positively charged reagent coupled to high-resolution tandem mass spectrometry.
Collapse
|
23
|
Armstrong SD, Xia D, Bah GS, Krishna R, Ngangyung HF, LaCourse EJ, McSorley HJ, Kengne-Ouafo JA, Chounna-Ndongmo PW, Wanji S, Enyong PA, Taylor DW, Blaxter ML, Wastling JM, Tanya VN, Makepeace BL. Stage-specific Proteomes from Onchocerca ochengi, Sister Species of the Human River Blindness Parasite, Uncover Adaptations to a Nodular Lifestyle. Mol Cell Proteomics 2016; 15:2554-75. [PMID: 27226403 PMCID: PMC4974336 DOI: 10.1074/mcp.m115.055640] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Revised: 04/30/2016] [Indexed: 12/13/2022] Open
Abstract
Despite 40 years of control efforts, onchocerciasis (river blindness) remains one of the most important neglected tropical diseases, with 17 million people affected. The etiological agent, Onchocerca volvulus, is a filarial nematode with a complex lifecycle involving several distinct stages in the definitive host and blackfly vector. The challenges of obtaining sufficient material have prevented high-throughput studies and the development of novel strategies for disease control and diagnosis. Here, we utilize the closest relative of O. volvulus, the bovine parasite Onchocerca ochengi, to compare stage-specific proteomes and host-parasite interactions within the secretome. We identified a total of 4260 unique O. ochengi proteins from adult males and females, infective larvae, intrauterine microfilariae, and fluid from intradermal nodules. In addition, 135 proteins were detected from the obligate Wolbachia symbiont. Observed protein families that were enriched in all whole body extracts relative to the complete search database included immunoglobulin-domain proteins, whereas redox and detoxification enzymes and proteins involved in intracellular transport displayed stage-specific overrepresentation. Unexpectedly, the larval stages exhibited enrichment for several mitochondrial-related protein families, including members of peptidase family M16 and proteins which mediate mitochondrial fission and fusion. Quantification of proteins across the lifecycle using the Hi-3 approach supported these qualitative analyses. In nodule fluid, we identified 94 O. ochengi secreted proteins, including homologs of transforming growth factor-β and a second member of a novel 6-ShK toxin domain family, which was originally described from a model filarial nematode (Litomosoides sigmodontis). Strikingly, the 498 bovine proteins identified in nodule fluid were strongly dominated by antimicrobial proteins, especially cathelicidins. This first high-throughput analysis of an Onchocerca spp. proteome across the lifecycle highlights its profound complexity and emphasizes the extremely close relationship between O. ochengi and O. volvulus The insights presented here provide new candidates for vaccine development, drug targeting and diagnostic biomarkers.
Collapse
Affiliation(s)
- Stuart D Armstrong
- From the ‡Institute of Infection & Global Health, University of Liverpool, Liverpool L3 5RF, UK
| | - Dong Xia
- From the ‡Institute of Infection & Global Health, University of Liverpool, Liverpool L3 5RF, UK
| | - Germanus S Bah
- §Institut de Recherche Agricole pour le Développement, Regional Centre of Wakwa, BP65 Ngaoundéré, Cameroon
| | - Ritesh Krishna
- ¶Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Henrietta F Ngangyung
- §Institut de Recherche Agricole pour le Développement, Regional Centre of Wakwa, BP65 Ngaoundéré, Cameroon
| | - E James LaCourse
- ‖Department of Parasitology, Liverpool School of Tropical Medicine, Liverpool, L3 5QA, UK
| | - Henry J McSorley
- **The Queens Medical Research Institute, University of Edinburgh, Edinburgh, EH16 4JT
| | - Jonas A Kengne-Ouafo
- ‡‡Research Foundation for Tropical Diseases and Environment, PO Box 474 Buea, Cameroon
| | | | - Samuel Wanji
- ‡‡Research Foundation for Tropical Diseases and Environment, PO Box 474 Buea, Cameroon
| | - Peter A Enyong
- ‡‡Research Foundation for Tropical Diseases and Environment, PO Box 474 Buea, Cameroon; §§Tropical Medicine Research Station, Kumba, Cameroon
| | - David W Taylor
- From the ‡Institute of Infection & Global Health, University of Liverpool, Liverpool L3 5RF, UK; ¶¶Division of Pathway Medicine, University of Edinburgh, Edinburgh EH9 3JT, UK
| | - Mark L Blaxter
- ‖‖Institute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3JT, UK
| | - Jonathan M Wastling
- From the ‡Institute of Infection & Global Health, University of Liverpool, Liverpool L3 5RF, UK; ‡‡‡The National Institute for Health Research, Health Protection Research Unit in Emerging and Zoonotic Infections, University of Liverpool, Liverpool L3 5RF, UK
| | - Vincent N Tanya
- §Institut de Recherche Agricole pour le Développement, Regional Centre of Wakwa, BP65 Ngaoundéré, Cameroon
| | - Benjamin L Makepeace
- From the ‡Institute of Infection & Global Health, University of Liverpool, Liverpool L3 5RF, UK;
| |
Collapse
|
24
|
Sheynkman GM, Shortreed MR, Cesnik AJ, Smith LM. Proteogenomics: Integrating Next-Generation Sequencing and Mass Spectrometry to Characterize Human Proteomic Variation. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:521-45. [PMID: 27049631 PMCID: PMC4991544 DOI: 10.1146/annurev-anchem-071015-041722] [Citation(s) in RCA: 73] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Mass spectrometry-based proteomics has emerged as the leading method for detection, quantification, and characterization of proteins. Nearly all proteomic workflows rely on proteomic databases to identify peptides and proteins, but these databases typically contain a generic set of proteins that lack variations unique to a given sample, precluding their detection. Fortunately, proteogenomics enables the detection of such proteomic variations and can be defined, broadly, as the use of nucleotide sequences to generate candidate protein sequences for mass spectrometry database searching. Proteogenomics is experiencing heightened significance due to two developments: (a) advances in DNA sequencing technologies that have made complete sequencing of human genomes and transcriptomes routine, and (b) the unveiling of the tremendous complexity of the human proteome as expressed at the levels of genes, cells, tissues, individuals, and populations. We review here the field of human proteogenomics, with an emphasis on its history, current implementations, the types of proteomic variations it reveals, and several important applications.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215;
- Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Michael R Shortreed
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Anthony J Cesnik
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
| | - Lloyd M Smith
- Department of Chemistry, University of Wisconsin, Madison, Wisconsin 53706; ,
- Genome Center of Wisconsin, University of Wisconsin, Madison, Wisconsin 53706;
| |
Collapse
|
25
|
Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2016; 926:1-10. [DOI: 10.1007/978-3-319-42316-6_1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
26
|
Abstract
![]()
Every
molecular player in the cast of biology’s central
dogma is being sequenced and quantified with increasing ease and coverage.
To bring the resulting genomic, transcriptomic, and proteomic data
sets into coherence, tools must be developed that do not constrain
data acquisition and analytics in any way but rather provide simple
links across previously acquired data sets with minimal preprocessing
and hassle. Here we present such a tool: PGx, which supports proteogenomic
integration of mass spectrometry proteomics data with next-generation
sequencing by mapping identified peptides onto their putative genomic
coordinates.
Collapse
Affiliation(s)
- Manor Askenazi
- Biomedical Hosting LLC, 33 Lewis Avenue, Arlington, Massachusetts 02474, United States
| | - Kelly V Ruggles
- NYU Langone Medical Center , 227 East 30th Street, New York, New York 10016, United States
| | - David Fenyö
- NYU Langone Medical Center , 227 East 30th Street, New York, New York 10016, United States
| |
Collapse
|
27
|
Wang X, Slebos RJC, Chambers MC, Tabb DL, Liebler DC, Zhang B. proBAMsuite, a Bioinformatics Framework for Genome-Based Representation and Analysis of Proteomics Data. Mol Cell Proteomics 2015; 15:1164-75. [PMID: 26657539 PMCID: PMC4813696 DOI: 10.1074/mcp.m115.052860] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Indexed: 01/13/2023] Open
Abstract
To facilitate genome-based representation and analysis of proteomics data, we developed a new bioinformatics framework, proBAMsuite, in which a central component is the protein BAM (proBAM) file format for organizing peptide spectrum matches (PSMs)1 within the context of the genome. proBAMsuite also includes two R packages, proBAMr and proBAMtools, for generating and analyzing proBAM files, respectively. Applying proBAMsuite to three recently published proteomics datasets, we demonstrated its utility in facilitating efficient genome-based sharing, interpretation, and integration of proteomics data. First, the interpretation of proteomics data is significantly enhanced with the rich genomic annotation information. Second, PSMs can be easily reannotated using user-specified gene annotation schemes and assembled into both protein and gene identifications. Third, using the genome as a common reference, proBAMsuite facilitates seamless proteomics and proteogenomics data integration. Finally, proBAM files can be readily visualized in genome browsers and thus bring proteomics data analysis to a general audience beyond the proteomics community. Results from this study establish proBAMsuite as a useful bioinformatics framework for proteomics and proteogenomics research.
Collapse
Affiliation(s)
| | - Robbert J C Slebos
- §Department of Biochemistry, ¶Jim Ayers Institute for Precancer Detection and Diagnosis, Vanderbilt-Ingram Cancer Center, Nashville, TN 37232
| | | | - David L Tabb
- From the ‡Department of Biomedical Informatics, §Department of Biochemistry
| | - Daniel C Liebler
- From the ‡Department of Biomedical Informatics, §Department of Biochemistry, ¶Jim Ayers Institute for Precancer Detection and Diagnosis, Vanderbilt-Ingram Cancer Center, Nashville, TN 37232
| | - Bing Zhang
- From the ‡Department of Biomedical Informatics, ‖Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN 37232;
| |
Collapse
|
28
|
Zhou T, Sha J, Guo X. The need to revisit published data: A concept and framework for complementary proteomics. Proteomics 2015; 16:6-11. [PMID: 26552962 DOI: 10.1002/pmic.201500170] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Revised: 08/26/2015] [Accepted: 11/04/2015] [Indexed: 12/14/2022]
Abstract
Tandem proteomic strategies based on large-scale and high-resolution mass spectrometry have been widely applied in various biomedical studies. However, protein sequence databases and proteomic software are continuously updated. Proteomic studies should not be ended with a stable list of proteins. It is necessary and beneficial to regularly revise the results. Besides, the original proteomic studies usually focused on a limited aspect of protein information and valuable information may remain undiscovered in the raw spectra. Several studies have reported novel findings by reanalyzing previously published raw data. However, there are still no standard guidelines for comprehensive reanalysis. In the present study, we proposed the concept and draft framework for complementary proteomics, which are aimed to revise protein list or mine new discoveries by revisiting published data.
Collapse
Affiliation(s)
- Tao Zhou
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, P. R. China
| | - Jiahao Sha
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, P. R. China
| | - Xuejiang Guo
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, P. R. China
| |
Collapse
|
29
|
Mayne J, Ning Z, Zhang X, Starr AE, Chen R, Deeke S, Chiang CK, Xu B, Wen M, Cheng K, Seebun D, Star A, Moore JI, Figeys D. Bottom-Up Proteomics (2013-2015): Keeping up in the Era of Systems Biology. Anal Chem 2015; 88:95-121. [PMID: 26558748 DOI: 10.1021/acs.analchem.5b04230] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
- Janice Mayne
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Zhibin Ning
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Xu Zhang
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Amanda E Starr
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Rui Chen
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Shelley Deeke
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Cheng-Kang Chiang
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Bo Xu
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Ming Wen
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Kai Cheng
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Deeptee Seebun
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Alexandra Star
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Jasmine I Moore
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| | - Daniel Figeys
- Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa , 451 Smyth Rd., Ottawa, Ontario, Canada , K1H8M5
| |
Collapse
|
30
|
Kumar D, Yadav AK, Jia X, Mulvenna J, Dash D. Integrated Transcriptomic-Proteomic Analysis Using a Proteogenomic Workflow Refines Rat Genome Annotation. Mol Cell Proteomics 2015; 15:329-39. [PMID: 26560066 DOI: 10.1074/mcp.m114.047126] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2014] [Indexed: 11/06/2022] Open
Abstract
Proteogenomic re-annotation and mRNA splicing information can lead to the discovery of various protein forms for eukaryotic model organisms like rat. However, detection of novel proteoforms using mass spectrometry proteomics data remains a formidable challenge. We developed EuGenoSuite, an open source multiple algorithmic proteomic search tool and utilized it in our in-house integrated transcriptomic-proteomic pipeline to facilitate automated proteogenomic analysis. Using four proteogenomic pipelines (integrated transcriptomic-proteomic, Peppy, Enosi, and ProteoAnnotator) on publicly available RNA-sequence and MS proteomics data, we discovered 363 novel peptides in rat brain microglia representing novel proteoforms for 249 gene loci in the rat genome. These novel peptides aided in the discovery of novel exons, translation of annotated untranslated regions, pseudogenes, and splice variants for various loci; many of which have known disease associations, including neurological disorders like schizophrenia, amyotrophic lateral sclerosis, etc. Novel isoforms were also discovered for genes implicated in cardiovascular diseases and breast cancer for which rats are considered model organisms. Our integrative multi-omics data analysis not only enables the discovery of new proteoforms but also generates an improved reference for human disease studies in the rat model.
Collapse
Affiliation(s)
- Dhirendra Kumar
- From the ‡G. N. Ramachandran Knowledge Centre for Genome Informatics, Council of Scientific and Industrial Research-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, India
| | - Amit Kumar Yadav
- From the ‡G. N. Ramachandran Knowledge Centre for Genome Informatics, Council of Scientific and Industrial Research-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, India
| | - Xinying Jia
- ¶Infectious Diseases Program, QIMR Berghofer Medical Research Institute, Brisbane, Australia
| | - Jason Mulvenna
- ¶Infectious Diseases Program, QIMR Berghofer Medical Research Institute, Brisbane, Australia
| | - Debasis Dash
- From the ‡G. N. Ramachandran Knowledge Centre for Genome Informatics, Council of Scientific and Industrial Research-Institute of Genomics and Integrative Biology, South Campus, Sukhdev Vihar, New Delhi, India;
| |
Collapse
|
31
|
Krishna R, Xia D, Sanderson S, Shanmugasundram A, Vermont S, Bernal A, Daniel-Naguib G, Ghali F, Brunk BP, Roos DS, Wastling JM, Jones AR. A large-scale proteogenomics study of apicomplexan pathogens-Toxoplasma gondii and Neospora caninum. Proteomics 2015; 15:2618-28. [PMID: 25867681 PMCID: PMC4692086 DOI: 10.1002/pmic.201400553] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 02/09/2015] [Accepted: 04/09/2015] [Indexed: 01/08/2023]
Abstract
Proteomics data can supplement genome annotation efforts, for example being used to confirm gene models or correct gene annotation errors. Here, we present a large-scale proteogenomics study of two important apicomplexan pathogens: Toxoplasma gondii and Neospora caninum. We queried proteomics data against a panel of official and alternate gene models generated directly from RNASeq data, using several newly generated and some previously published MS datasets for this meta-analysis. We identified a total of 201 996 and 39 953 peptide-spectrum matches for T. gondii and N. caninum, respectively, at a 1% peptide FDR threshold. This equated to the identification of 30 494 distinct peptide sequences and 2921 proteins (matches to official gene models) for T. gondii, and 8911 peptides/1273 proteins for N. caninum following stringent protein-level thresholding. We have also identified 289 and 140 loci for T. gondii and N. caninum, respectively, which mapped to RNA-Seq-derived gene models used in our analysis and apparently absent from the official annotation (release 10 from EuPathDB) of these species. We present several examples in our study where the RNA-Seq evidence can help in correction of the current gene model and can help in discovery of potential new genes. The findings of this study have been integrated into the EuPathDB. The data have been deposited to the ProteomeXchange with identifiers PXD000297and PXD000298.
Collapse
Affiliation(s)
- Ritesh Krishna
- Institute of Integrative Biology, University of Liverpool, Liverpool, Merseyside, UK.,Institute of Infection and Global Health, University of Liverpool, Liverpool, Merseyside, UK
| | - Dong Xia
- Institute of Infection and Global Health, University of Liverpool, Liverpool, Merseyside, UK
| | - Sanya Sanderson
- Institute of Infection and Global Health, University of Liverpool, Liverpool, Merseyside, UK
| | - Achchuthan Shanmugasundram
- Institute of Integrative Biology, University of Liverpool, Liverpool, Merseyside, UK.,Institute of Infection and Global Health, University of Liverpool, Liverpool, Merseyside, UK
| | - Sarah Vermont
- Institute of Infection and Global Health, University of Liverpool, Liverpool, Merseyside, UK
| | - Axel Bernal
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | | | - Fawaz Ghali
- Institute of Integrative Biology, University of Liverpool, Liverpool, Merseyside, UK
| | - Brian P Brunk
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - David S Roos
- Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
| | - Jonathan M Wastling
- Institute of Infection and Global Health, University of Liverpool, Liverpool, Merseyside, UK
| | - Andrew R Jones
- Institute of Integrative Biology, University of Liverpool, Liverpool, Merseyside, UK
| |
Collapse
|