1
|
Yang H, Zhang H, Gu H, Wang J, Zhang J, Zen K, Li D. Comparative Analyses of Human Exosome Proteomes. Protein J 2023:10.1007/s10930-023-10100-0. [PMID: 36892742 DOI: 10.1007/s10930-023-10100-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2023] [Indexed: 03/10/2023]
Abstract
Exosomes are responsible for cell-to-cell communication and serves as a valuable drug delivery vehicle. However, exosome heterogeneity, non-standardized isolation methods and proteomics/bioinformatics approaches limit its clinical applications. To better understand exosome heterogeneity, biological function and molecular mechanism of its biogenesis, secretion and uptake, techniques in proteomics or bioinformatics were applied to investigate human embryonic kidney cell (293T cell line)-derived exosome proteome and enable an integrative comparison of exosomal proteins and protein-protein interaction (PPI) networks of eleven exosome proteomes extracted from diverse human samples, including 293T (two datasets), dermal fibroblast, mesenchymal stem cell, thymic epithelial primary cell, breast cancer cell line (MDA-MB-231), patient neuroblastoma cell, plasma, saliva, serum and urine. Mapping of exosome biogenesis/secretion/uptake-related proteins onto exosome proteomes highlights exosomal origin-specific routes of exosome biogenesis/secretion/uptake and exosome-dependent intercellular communication. The finding provides insight into comparative exosome proteomes and its biogenesis, secretion and uptake, and potentially contributes to clinical applications.
Collapse
Affiliation(s)
- Hao Yang
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China
| | - Haiyang Zhang
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China
| | - Hongwei Gu
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China
| | - Jin Wang
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China
| | - Junfeng Zhang
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China
| | - Ke Zen
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China
| | - Donghai Li
- State Key Laboratory of Pharmaceutical Biotechnology, Jiangsu Engineering Research Center for MicroRNA Biology and Biotechnology, School of Life Sciences, Nanjing University, 210023, Jiangsu, P.R. China.
| |
Collapse
|
2
|
Liu M, Inoue K, Leng T, Zhou A, Guo S, Xiong ZG. ASIC1 promotes differentiation of neuroblastoma by negatively regulating Notch signaling pathway. Oncotarget 2018; 8:8283-8293. [PMID: 28030818 PMCID: PMC5352400 DOI: 10.18632/oncotarget.14164] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2016] [Accepted: 11/23/2016] [Indexed: 12/25/2022] Open
Abstract
In neurons, up-regulation of Notch activity either inhibits neurite extension or causes retraction of neurites. Conversely, inhibition of Notch1 facilitates neurite extension. Acid-sensing ion channels (ASICs) are a family of proton-gated cation channels, which play critical roles in synaptic plasticity, learning and memory and spine morphogenesis. Our pilot proteomics data from ASIC1a knock out mice implicated that ASIC1a may play a role in regulating Notch signaling, therefore, we explored whether or not ASIC1a regulates neurite growth during neuronal development through Notch signaling. In this study, we determined the effects of ASIC1a on neurite growth in a mouse neuroblastoma cell line, NS20Y cells, by modulating ASIC1a expression. We also determined the relationship between ASIC1a and Notch signaling on neuronal differentiation. Our results showed that down-regulation of ASIC1a in NS20Y cells inhibits CPT-cAMP induced neurite growth, while over expression of ASIC1a promotes its growth. In addition, down-regulation of ASIC1a increased the expression of Notch1 and its target gene Survivin while inhibitor of Notch significantly prevented the neurite extension induced by ASIC1a in NS20Y cells. These data indicate that Notch1 signaling may be required for ASIC1a-mediated neurite growth and neuronal differentiation.
Collapse
Affiliation(s)
- Mingli Liu
- Department of Microbiology, Biochemistry & Immunology, Atlanta, GA 30310, USA
| | - Koichi Inoue
- Neuroscience Institute, Morehouse School of Medicine, Atlanta, GA 30310, USA
| | - Tiandong Leng
- Neuroscience Institute, Morehouse School of Medicine, Atlanta, GA 30310, USA
| | - An Zhou
- Neuroscience Institute, Morehouse School of Medicine, Atlanta, GA 30310, USA
| | - Shanchun Guo
- Department of Chemistry, RCMI Cancer Research Center, Xavier University of Louisiana, New Orleans, LA 70125, USA
| | - Zhi-Gang Xiong
- Neuroscience Institute, Morehouse School of Medicine, Atlanta, GA 30310, USA
| |
Collapse
|
3
|
Goto R, Nakamura Y, Takami T, Sanke T, Tozuka Z. Quantitative LC-MS/MS Analysis of Proteins Involved in Metastasis of Breast Cancer. PLoS One 2015; 10:e0130760. [PMID: 26176947 PMCID: PMC4503764 DOI: 10.1371/journal.pone.0130760] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Accepted: 05/22/2015] [Indexed: 12/29/2022] Open
Abstract
The purpose of this study was to develop quantitative liquid chromatography-tandem mass spectrometry (LC-MS/MS) methods for the analysis of proteins involved in metastasis of breast cancer for diagnosis and determining disease prognosis, as well as to further our understand of metastatic mechanisms. We have previously demonstrated that the protein type XIV collagen may be specifically expressed in metastatic tissues by two dimensional LC-MS/MS. In this study, we developed quantitative LC-MS/MS methods for type XIV collagen. Type XIV collagen was quantified by analyzing 2 peptides generated by digesting type XIV collagen using stable isotope-labeled peptides. The individual concentrations were equivalent between 2 different peptides of type XIV collagen by evaluation of imprecise transitions and using the best transition for the peptide concentration. The results indicated that type XIV collagen is highly expressed in metastatic tissues of patients with massive lymph node involvement compared to non-metastatic tissues. These findings were validated by quantitative real-time RT-PCR. Further studies on type XIV collagen are desired to verify its role as a prognostic factor and diagnosis marker for metastasis.
Collapse
Affiliation(s)
- Rieko Goto
- Department of Clinical Laboratory Medicine, Wakayama Medical University,Wakayama, Japan
- JCL Bioassay Corporation, Nishiwaki, Hyogo, Japan
- * E-mail:
| | - Yasushi Nakamura
- Department of Clinical Laboratory Medicine, Wakayama Medical University,Wakayama, Japan
| | | | - Tokio Sanke
- Department of Clinical Laboratory Medicine, Wakayama Medical University,Wakayama, Japan
| | - Zenzaburo Tozuka
- Graduate School of Pharmaceutical Science Osaka University, Suita, Osaka, Japan
| |
Collapse
|
4
|
Xiong W, Abraham PE, Li Z, Pan C, Hettich RL. Microbial metaproteomics for characterizing the range of metabolic functions and activities of human gut microbiota. Proteomics 2015; 15:3424-38. [PMID: 25914197 DOI: 10.1002/pmic.201400571] [Citation(s) in RCA: 68] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2014] [Revised: 03/08/2015] [Accepted: 04/21/2015] [Indexed: 01/12/2023]
Abstract
The human gastrointestinal tract is a complex, dynamic ecosystem that consists of a carefully tuned balance of human host and microbiota membership. The microbiome is not merely a collection of opportunistic parasites, but rather provides important functions to the host that are absolutely critical to many aspects of health, including nutrient transformation and absorption, drug metabolism, pathogen defense, and immune system development. Microbial metaproteomics provides the ability to characterize the human gut microbiota functions and metabolic activities at a remarkably deep level, revealing information about microbiome development and stability as well as their interactions with their human host. Generally, microbial and human proteins can be extracted and then measured by high performance MS-based proteomics technology. Here, we review the field of human gut microbiome metaproteomics, with a focus on the experimental and informatics considerations involved in characterizing systems ranging from low-complexity model gut microbiota in gnotobiotic mice, to the emerging gut microbiome in the GI tract of newborn human infants, and finally to an established gut microbiota in human adults.
Collapse
Affiliation(s)
- Weili Xiong
- Chemical Science Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.,Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, Tennessee, USA
| | - Paul E Abraham
- Chemical Science Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Zhou Li
- Chemical Science Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Chongle Pan
- Chemical Science Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Robert L Hettich
- Chemical Science Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA.,Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, Tennessee, USA
| |
Collapse
|
5
|
Xiong W, Giannone RJ, Morowitz MJ, Banfield JF, Hettich RL. Development of an enhanced metaproteomic approach for deepening the microbiome characterization of the human infant gut. J Proteome Res 2014; 14:133-41. [PMID: 25350865 PMCID: PMC4286196 DOI: 10.1021/pr500936p] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
![]()
The establishment of early life microbiota
in the human infant
gut is highly variable and plays a crucial role in host nutrient availability/uptake
and maturation of immunity. Although high-performance mass spectrometry
(MS)-based metaproteomics is a powerful method for the functional
characterization of complex microbial communities, the acquisition
of comprehensive metaproteomic information in human fecal samples
is inhibited by the presence of abundant human proteins. To alleviate
this restriction, we have designed a novel metaproteomic strategy
based on double filtering (DF) the raw samples, a method that fractionates
microbial from human cells to enhance microbial protein identification
and characterization in complex fecal samples from healthy premature
infants. This method dramatically improved the overall depth of infant
gut proteome measurement, with an increase in the number of identified
low-abundance proteins and a greater than 2-fold improvement in microbial
protein identification and quantification. This enhancement of proteome
measurement depth enabled a more extensive microbiome comparison between
infants by not only increasing the confidence of identified microbial
functional categories but also revealing previously undetected categories.
Collapse
Affiliation(s)
- Weili Xiong
- Chemical Sciences Division, Oak Ridge National Laboratory , Oak Ridge, Tennessee 37831, United States
| | | | | | | | | |
Collapse
|
6
|
Kelchtermans P, Bittremieux W, De Grave K, Degroeve S, Ramon J, Laukens K, Valkenborg D, Barsnes H, Martens L. Machine learning applications in proteomics research: how the past can boost the future. Proteomics 2014; 14:353-66. [PMID: 24323524 DOI: 10.1002/pmic.201300289] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Revised: 09/24/2013] [Accepted: 10/14/2013] [Indexed: 01/22/2023]
Abstract
Machine learning is a subdiscipline within artificial intelligence that focuses on algorithms that allow computers to learn solving a (complex) problem from existing data. This ability can be used to generate a solution to a particularly intractable problem, given that enough data are available to train and subsequently evaluate an algorithm on. Since MS-based proteomics has no shortage of complex problems, and since publicly available data are becoming available in ever growing amounts, machine learning is fast becoming a very popular tool in the field. We here therefore present an overview of the different applications of machine learning in proteomics that together cover nearly the entire wet- and dry-lab workflow, and that address key bottlenecks in experiment planning and design, as well as in data processing and analysis.
Collapse
Affiliation(s)
- Pieter Kelchtermans
- Department of Medical Protein Research, VIB, Ghent, Belgium; Faculty of Medicine and Health Sciences, Department of Biochemistry, Ghent University, Ghent, Belgium; Flemish Institute for Technological Research (VITO), Boeretang, Mol, Belgium
| | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Dykstra AB, St Brice L, Rodriguez M, Raman B, Izquierdo J, Cook KD, Lynd LR, Hettich RL. Development of a multipoint quantitation method to simultaneously measure enzymatic and structural components of the Clostridium thermocellum cellulosome protein complex. J Proteome Res 2013; 13:692-701. [PMID: 24274857 DOI: 10.1021/pr400788e] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Clostridium thermocellum has emerged as a leading bioenergy-relevant microbe due to its ability to solubilize cellulose into carbohydrates, mediated by multicomponent membrane-attached complexes termed cellulosomes. To probe microbial cellulose utilization rates, it is desirable to be able to measure the concentrations of saccharolytic enzymes and estimate the total amount of cellulosome present on a mass basis. Current cellulase determination methodologies involve labor-intensive purification procedures and only allow for indirect determination of abundance. We have developed a method using multiple reaction monitoring (MRM-MS) to simultaneously quantitate both enzymatic and structural components of the cellulosome protein complex in samples ranging in complexity from purified cellulosomes to whole cell lysates, as an alternative to a previously developed enzyme-linked immunosorbent assay (ELISA) method of cellulosome quantitation. The precision of the cellulosome mass concentration in technical replicates is better than 5% relative standard deviation for all samples, indicating high precision for determination of the mass concentration of cellulosome components.
Collapse
Affiliation(s)
- Andrew B Dykstra
- Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831-6131, United States
| | | | | | | | | | | | | | | |
Collapse
|
8
|
Day RS, McDade KK. A decision theory paradigm for evaluating identifier mapping and filtering methods using data integration. BMC Bioinformatics 2013; 14:223. [PMID: 23855655 PMCID: PMC3734162 DOI: 10.1186/1471-2105-14-223] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2012] [Accepted: 07/09/2013] [Indexed: 01/21/2023] Open
Abstract
Background In bioinformatics, we pre-process raw data into a format ready for answering medical and biological questions. A key step in processing is labeling the measured features with the identities of the molecules purportedly assayed: “molecular identification” (MI). Biological meaning comes from identifying these molecular measurements correctly with actual molecular species. But MI can be incorrect. Identifier filtering (IDF) selects features with more trusted MI, leaving a smaller, but more correct dataset. Identifier mapping (IDM) is needed when an analyst is combining two high-throughput (HT) measurement platforms on the same samples. IDM produces ID pairs, one ID from each platform, where the mapping declares that the two analytes are associated through a causal path, direct or indirect (example: pairing an ID for an mRNA species with an ID for a protein species that is its putative translation). Many competing solutions for IDF and IDM exist. Analysts need a rigorous method for evaluating and comparing all these choices. Results We describe a paradigm for critically evaluating and comparing IDF and IDM methods, guided by data on biological samples. The requirements are: a large set of biological samples, measurements on those samples from at least two high-throughput platforms, a model family connecting features from the platforms, and an association measure. From these ingredients, one fits a mixture model coupled to a decision framework. We demonstrate this evaluation paradigm in three settings: comparing performance of several bioinformatics resources for IDM between transcripts and proteins, comparing several published microarray probeset IDF methods and their combinations, and selecting optimal quality thresholds for tandem mass spectrometry spectral events. Conclusions The paradigm outlined here provides a data-grounded approach for evaluating the quality not just of IDM and IDF, but of any pre-processing step or pipeline. The results will help researchers to semantically integrate or filter data optimally, and help bioinformatics database curators to track changes in quality over time and even to troubleshoot causes of MI errors.
Collapse
Affiliation(s)
- Roger S Day
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
| | | |
Collapse
|
9
|
Hettich RL, Pan C, Chourey K, Giannone RJ. Metaproteomics: harnessing the power of high performance mass spectrometry to identify the suite of proteins that control metabolic activities in microbial communities. Anal Chem 2013; 85:4203-14. [PMID: 23469896 PMCID: PMC3696428 DOI: 10.1021/ac303053e] [Citation(s) in RCA: 140] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The availability of extensive genome information for many different microbes, including unculturable species in mixed communities from environmental samples, has enabled systems-biology interrogation by providing a means to access genomic, transcriptomic, and proteomic information. To this end, metaproteomics exploits the power of high-performance mass spectrometry for extensive characterization of the complete suite of proteins expressed by a microbial community in an environmental sample.
Collapse
|
10
|
Dereplicating nonribosomal peptides using an informatic search algorithm for natural products (iSNAP) discovery. Proc Natl Acad Sci U S A 2012; 109:19196-201. [PMID: 23132949 DOI: 10.1073/pnas.1206376109] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Nonribosomal peptides are highly sought after for their therapeutic applications. As with other natural products, dereplication of known compounds and focused discovery of new agents within this class are central concerns of modern natural product-based drug discovery. Development of a chemoinformatic library-based and informatic search strategy for natural products (iSNAP) has been constructed and applied to nonribosomal peptides and proved useful for true nontargeted dereplication across a spectrum of nonribosomal peptides and within natural product extracts.
Collapse
|
11
|
Lin W, Wang J, Zhang WJ, Wu FX. An unsupervised machine learning method for assessing quality of tandem mass spectra. Proteome Sci 2012; 10 Suppl 1:S12. [PMID: 22759570 PMCID: PMC3380733 DOI: 10.1186/1477-5956-10-s1-s12] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Background In a single proteomic project, tandem mass spectrometers can produce hundreds of millions of tandem mass spectra. However, majority of tandem mass spectra are of poor quality, it wastes time to search them for peptides. Therefore, the quality assessment (before database search) is very useful in the pipeline of protein identification via tandem mass spectra, especially on the reduction of searching time and the decrease of false identifications. Most existing methods for quality assessment are supervised machine learning methods based on a number of features which describe the quality of tandem mass spectra. These methods need the training datasets with knowing the quality of all spectra, which are usually unavailable for the new datasets. Results This study proposes an unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset. This proposed method estimates the conditional probabilities of spectra being high quality from the quality assessments based on individual features. The probabilities are estimated through a constraint optimization problem. An efficient algorithm is developed to solve the constraint optimization problem and is proved to be convergent. Experimental results on two datasets illustrate that if we search only tandem spectra with the high quality determined by the proposed method, we can save about 56 % and 62% of database searching time while losing only a small amount of high-quality spectra. Conclusions Results indicate that the proposed method has a good performance for the quality assessment of tandem mass spectra and the way we estimate the conditional probabilities is effective.
Collapse
Affiliation(s)
- Wenjun Lin
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Dr,, Saskatoon, S7N 5A9, Canada.
| | | | | | | |
Collapse
|
12
|
Johnston C, Ibrahim A, Magarvey N. Informatic strategies for the discovery of polyketides and nonribosomal peptides. MEDCHEMCOMM 2012. [DOI: 10.1039/c2md20120h] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
A modern challenge and opportunity exists for in the ability to link genomic and metabolomic data, using novel informatic methods to find new bioactive natural products.
Collapse
Affiliation(s)
- Chad Johnston
- Department of Biochemistry and Biomedical Sciences
- Department of Chemistry and Chemical Biology
- M. G. DeGroote Institute for Infectious Disease Research
- McMaster University
- Hamilton
| | - Ashraf Ibrahim
- Department of Biochemistry and Biomedical Sciences
- Department of Chemistry and Chemical Biology
- M. G. DeGroote Institute for Infectious Disease Research
- McMaster University
- Hamilton
| | - Nathan Magarvey
- Department of Biochemistry and Biomedical Sciences
- Department of Chemistry and Chemical Biology
- M. G. DeGroote Institute for Infectious Disease Research
- McMaster University
- Hamilton
| |
Collapse
|
13
|
Källberg M, Lu H. An improved machine learning protocol for the identification of correct Sequest search results. BMC Bioinformatics 2010; 11:591. [PMID: 21138573 PMCID: PMC3013103 DOI: 10.1186/1471-2105-11-591] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2010] [Accepted: 12/07/2010] [Indexed: 11/18/2022] Open
Abstract
Background Mass spectrometry has become a standard method by which the proteomic profile of cell or tissue samples is characterized. To fully take advantage of tandem mass spectrometry (MS/MS) techniques in large scale protein characterization studies robust and consistent data analysis procedures are crucial. In this work we present a machine learning based protocol for the identification of correct peptide-spectrum matches from Sequest database search results, improving on previously published protocols. Results The developed model improves on published machine learning classification procedures by 6% as measured by the area under the ROC curve. Further, we show how the developed model can be presented as an interpretable tree of additive rules, thereby effectively removing the 'black-box' notion often associated with machine learning classifiers, allowing for comparison with expert rule-of-thumb. Finally, a method for extending the developed peptide identification protocol to give probabilistic estimates of the presence of a given protein is proposed and tested. Conclusions We demonstrate the construction of a high accuracy classification model for Sequest search results from MS/MS spectra obtained by using the MALDI ionization. The developed model performs well in identifying correct peptide-spectrum matches and is easily extendable to the protein identification problem. The relative ease with which additional experimental parameters can be incorporated into the classification framework, to give additional discriminatory power, allows for future tailoring of the model to take advantage of information from specific instrument set-ups.
Collapse
Affiliation(s)
- Morten Källberg
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | | |
Collapse
|
14
|
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010; 73:2092-123. [PMID: 20816881 DOI: 10.1016/j.jprot.2010.08.009] [Citation(s) in RCA: 358] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2010] [Revised: 08/25/2010] [Accepted: 08/25/2010] [Indexed: 12/18/2022]
Abstract
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Collapse
|
15
|
Yu W, Taylor JA, Davis MT, Bonilla LE, Lee KA, Auger PL, Farnsworth CC, Welcher AA, Patterson SD. Maximizing the sensitivity and reliability of peptide identification in large-scale proteomic experiments by harnessing multiple search engines. Proteomics 2010; 10:1172-89. [PMID: 20101609 DOI: 10.1002/pmic.200900074] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Despite recent advances in qualitative proteomics, the automatic identification of peptides with optimal sensitivity and accuracy remains a difficult goal. To address this deficiency, a novel algorithm, Multiple Search Engines, Normalization and Consensus is described. The method employs six search engines and a re-scoring engine to search MS/MS spectra against protein and decoy sequences. After the peptide hits from each engine are normalized to error rates estimated from the decoy hits, peptide assignments are then deduced using a minimum consensus model. These assignments are produced in a series of progressively relaxed false-discovery rates, thus enabling a comprehensive interpretation of the data set. Additionally, the estimated false-discovery rate was found to have good concordance with the observed false-positive rate calculated from known identities. Benchmarking against standard proteins data sets (ISBv1, sPRG2006) and their published analysis, demonstrated that the Multiple Search Engines, Normalization and Consensus algorithm consistently achieved significantly higher sensitivity in peptide identifications, which led to increased or more robust protein identifications in all data sets compared with prior methods. The sensitivity and the false-positive rate of peptide identification exhibit an inverse-proportional and linear relationship with the number of participating search engines.
Collapse
Affiliation(s)
- Wen Yu
- Computational Biology, Amgen Inc., Seattle, WA 98119-3105, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Yu K, Sabelli A, DeKeukelaere L, Park R, Sindi S, Gatsonis CA, Salomon A. Integrated platform for manual and high-throughput statistical validation of tandem mass spectra. Proteomics 2009; 9:3115-25. [PMID: 19526561 DOI: 10.1002/pmic.200800899] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
As proteomic data sets increase in size and complexity, the necessity for database-centric software systems able to organize, compare, and visualize all the proteomic experiments in a lab grows. We recently developed an integrated platform called high-throughput autonomous proteomic pipeline (HTAPP) for the automated acquisition and processing of quantitative proteomic data, and integration of proteomic results with existing external protein information resources within a lab-based relational database called PeptideDepot. Here, we introduce the peptide validation software component of this system, which combines relational database-integrated electronic manual spectral annotation in Java with a new software tool in the R programming language for the generation of logistic regression spectral models from user-supplied validated data sets and flexible application of these user-generated models in automated proteomic workflows. This logistic regression spectral model uses both variables computed directly from SEQUEST output in addition to deterministic variables based on expert manual validation criteria of spectral quality. In the case of linear quadrupole ion trap (LTQ) or LTQ-FTICR LC/MS data, our logistic spectral model outperformed both XCorr (242% more peptides identified on average) and the X!Tandem E-value (87% more peptides identified on average) at a 1% false discovery rate estimated by decoy database approach.
Collapse
Affiliation(s)
- Kebing Yu
- Department of Chemistry, Brown University, Providence, RI 02903, USA
| | | | | | | | | | | | | |
Collapse
|
17
|
Salmi J, Nyman TA, Nevalainen OS, Aittokallio T. Filtering strategies for improving protein identification in high-throughput MS/MS studies. Proteomics 2009; 9:848-60. [PMID: 19160393 DOI: 10.1002/pmic.200800517] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Despite the recent advances in streamlining high-throughput proteomic pipelines using tandem mass spectrometry (MS/MS), reliable identification of peptides and proteins on a larger scale has remained a challenging task, still involving a considerable degree of user interaction. Recently, a number of papers have proposed computational strategies both for distinguishing poor MS/MS spectra prior to database search (pre-filtering) as well as for verifying the peptide identifications made by the search programs (post-filtering). Both of these filtering approaches can be very beneficial to the overall protein identification pipeline, since they can remove a substantial part of the time consuming manual validation work and convert large sets of MS/MS spectra into more reliable and interpretable proteome information. The choice of the filtering method depends both on the properties of the data and on the goals of the experiment. This review discusses the different pre- and post-filtering strategies available to the researchers, together with their relative merits and potential pitfalls. We also highlight some additional research topics, such as spectral denoising and statistical assessment of the identification results, which aim at further improving the coverage and accuracy of high-throughput protein identification studies.
Collapse
Affiliation(s)
- Jussi Salmi
- Department of Information Technology, University of Turku, Turku, Finland.
| | | | | | | |
Collapse
|
18
|
Zou AM, Wu FX, Ding JR, Poirier GG. Quality assessment of tandem mass spectra using support vector machine (SVM). BMC Bioinformatics 2009; 10 Suppl 1:S49. [PMID: 19208151 PMCID: PMC2648784 DOI: 10.1186/1471-2105-10-s1-s49] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Tandem mass spectrometry has become particularly useful for the rapid identification and characterization of protein components of complex biological mixtures. Powerful database search methods have been developed for the peptide identification, such as SEQUEST and MASCOT, which are implemented by comparing the mass spectra obtained from unknown proteins or peptides with theoretically predicted spectra derived from protein databases. However, the majority of spectra generated from a mass spectrometry experiment are of too poor quality to be interpreted while some of spectra with high quality cannot be interpreted by one method but perhaps by others. Hence a filtering algorithm that removes those spectra with poor quality prior to the database search is appealing. Results This paper proposes a support vector machine (SVM) based approach to assess the quality of tandem mass spectra. Each mass spectrum is mapping into the 16 proposed features to describe its quality. Based the results from SEQUEST, four SVM classifiers with the input of the 16 features are trained and tested on ISB data and TOV data, respectively. The superior performance of the proposed SVM classifiers is illustrated both by the comparison with the existing classifiers and by the validation in terms of MASCOT search results. Conclusion The proposed method can be employed to effectively remove the poor quality spectra before the spectral searching, and also to find the more peptides or post-translational peptides from spectra with high quality using different search engines or de novo method.
Collapse
Affiliation(s)
- An-Min Zou
- Department of Mechanical Engineering, University of Saskatchewan, 57 Campus Dr, Saskatoon, SK, S7N 59A, Canada.
| | | | | | | |
Collapse
|
19
|
Shao C, Sun W, Li F, Yang R, Zhang L, Gao Y. Oscore: a combined score to reduce false negative rates for peptide identification in tandem mass spectrometry analysis. JOURNAL OF MASS SPECTROMETRY : JMS 2009; 44:25-31. [PMID: 18698557 DOI: 10.1002/jms.1466] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Tandem mass spectrometry (MS/MS) has been widely used in proteomics studies. Multiple algorithms have been developed for assessing matches between MS/MS spectra and peptide sequences in databases. However, it is still a challenge to reduce false negative rates without compromising the high confidence of peptide identification. In this study, we developed the score, Oscore, by logistic regression using SEQUEST and AMASS variables to identify fully tryptic peptides. Since these variables showed complicated association with each other, combining them together rather than applying them to a threshold model improved the classification of correct and incorrect peptide identifications. Oscore achieved both a lower false negative rate and a lower false positive rate than PeptideProphet on datasets from 18 known protein mixtures and several proteome-scale samples of different complexity, database size and separation methods. By a three-way comparison among Oscore, PeptideProphet and another logistic regression model which made use of PeptideProphet's variables, the main contributor for the improvement made by Oscore is discussed.
Collapse
Affiliation(s)
- Chen Shao
- Department of Physiology and Pathophysiology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing, China
| | | | | | | | | | | |
Collapse
|
20
|
Ding Y, Choi H, Nesvizhskii AI. Adaptive discriminant function analysis and reranking of MS/MS database search results for improved peptide identification in shotgun proteomics. J Proteome Res 2008; 7:4878-89. [PMID: 18788775 DOI: 10.1021/pr800484x] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Robust statistical validation of peptide identifications obtained by tandem mass spectrometry and sequence database searching is an important task in shotgun proteomics. PeptideProphet is a commonly used computational tool that computes confidence measures for peptide identifications. In this paper, we investigate several limitations of the PeptideProphet modeling approach, including the use of fixed coefficients in computing the discriminant search score and selection of the top scoring peptide assignment per spectrum only. To address these limitations, we describe an adaptive method in which a new discriminant function is learned from the data in an iterative fashion. We extend the modeling framework to go beyond the top scoring peptide assignment per spectrum. We also investigate the effect of clustering the spectra according to their spectrum quality score followed by cluster-specific mixture modeling. The analysis is carried out using data acquired from a mixture of purified proteins on four different types of mass spectrometers, as well as using a complex human serum data set. A special emphasis is placed on the analysis of data generated on high mass accuracy instruments.
Collapse
Affiliation(s)
- Ying Ding
- Department of Pathology, Department of Biostatistics, and Center for Computational Biology and Medicine, University of Michigan, Ann Arbor, Michigan 48109, USA
| | | | | |
Collapse
|
21
|
Koenig T, Menze BH, Kirchner M, Monigatti F, Parker KC, Patterson T, Steen JJ, Hamprecht FA, Steen H. Robust prediction of the MASCOT score for an improved quality assessment in mass spectrometric proteomics. J Proteome Res 2008; 7:3708-17. [PMID: 18707158 DOI: 10.1021/pr700859x] [Citation(s) in RCA: 136] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Protein identification by tandem mass spectrometry is based on the reliable processing of the acquired data. Unfortunately, the generation of a large number of poor quality spectra is commonly observed in LC-MS/MS, and the processing of these mostly noninformative spectra with its associated costs should be avoided. We present a continuous quality score that can be computed very quickly and that can be considered an approximation of the MASCOT score in case of a correct identification. This score can be used to reject low quality spectra prior to database identification, or to draw attention to those spectra that exhibit a (supposedly) high information content, but could not be identified. The proposed quality score can be calibrated automatically on site without the need for a manually generated training set. When this score is turned into a classifier and when features are used that are independent of the instrument, the proposed approach performs equally to previously published classifiers and feature sets and also gives insights into the behavior of the MASCOT score.
Collapse
Affiliation(s)
- Thomas Koenig
- Interdisciplinary Center for Scientific Computing, University of Heidelberg, 69120 Heidelberg, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Fang J, Dong Y, Williams TD, Lushington GH. Feature selection in validating mass spectrometry database search results. J Bioinform Comput Biol 2008; 6:223-40. [PMID: 18324754 DOI: 10.1142/s0219720008003345] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2007] [Revised: 10/11/2007] [Accepted: 10/26/2007] [Indexed: 11/18/2022]
Abstract
Tandem mass spectrometry (MS/MS) combined with protein database searching has been widely used in protein identification. A validation procedure is generally required to reduce the number of false positives. Advanced tools using statistical and machine learning approaches may provide faster and more accurate validation than manual inspection and empirical filtering criteria. In this study, we use two feature selection algorithms based on random forest and support vector machine to identify peptide properties that can be used to improve validation models. We demonstrate that an improved model based on an optimized set of features reduces the number of false positives by 58% relative to the model which used only search engine scores, at the same sensitivity score of 0.8. In addition, we develop classification models based on the physicochemical properties and protein sequence environment of these peptides without using search engine scores. The performance of the best model based on the support vector machine algorithm is at 0.8 AUC, 0.78 accuracy, and 0.7 specificity, suggesting a reasonably accurate classification. The identified properties important to fragmentation and ionization can be either used in independent validation tools or incorporated into peptide sequencing and database search algorithms to improve existing software programs.
Collapse
Affiliation(s)
- Jianwen Fang
- Bioinformatics Core Facility & Information and Telecommunication Technology Center, University of Kansas, 2099 Constant Dr., Lawrence, Kansas 66047, USA.
| | | | | | | |
Collapse
|
23
|
Allmer J, Kuhlgert S, Hippler M. 2DB: a Proteomics database for storage, analysis, presentation, and retrieval of information from mass spectrometric experiments. BMC Bioinformatics 2008; 9:302. [PMID: 18605993 PMCID: PMC2475538 DOI: 10.1186/1471-2105-9-302] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2008] [Accepted: 07/07/2008] [Indexed: 11/26/2022] Open
Abstract
Background The amount of information stemming from proteomics experiments involving (multi dimensional) separation techniques, mass spectrometric analysis, and computational analysis is ever-increasing. Data from such an experimental workflow needs to be captured, related and analyzed. Biological experiments within this scope produce heterogenic data ranging from pictures of one or two-dimensional protein maps and spectra recorded by tandem mass spectrometry to text-based identifications made by algorithms which analyze these spectra. Additionally, peptide and corresponding protein information needs to be displayed. Results In order to handle the large amount of data from computational processing of mass spectrometric experiments, automatic import scripts are available and the necessity for manual input to the database has been minimized. Information is in a generic format which abstracts from specific software tools typically used in such an experimental workflow. The software is therefore capable of storing and cross analysing results from many algorithms. A novel feature and a focus of this database is to facilitate protein identification by using peptides identified from mass spectrometry and link this information directly to respective protein maps. Additionally, our application employs spectral counting for quantitative presentation of the data. All information can be linked to hot spots on images to place the results into an experimental context. A summary of identified proteins, containing all relevant information per hot spot, is automatically generated, usually upon either a change in the underlying protein models or due to newly imported identifications. The supporting information for this report can be accessed in multiple ways using the user interface provided by the application. Conclusion We present a proteomics database which aims to greatly reduce evaluation time of results from mass spectrometric experiments and enhance result quality by allowing consistent data handling. Import functionality, automatic protein detection, and summary creation act together to facilitate data analysis. In addition, supporting information for these findings is readily accessible via the graphical user interface provided. The database schema and the implementation, which can easily be installed on virtually any server, can be downloaded in the form of a compressed file from our project webpage.
Collapse
Affiliation(s)
- Jens Allmer
- Institute for Plant Biochemistry and Biotechnology, University of Münster, Hindenburgplatz 55, Münster, Germany.
| | | | | |
Collapse
|
24
|
Toward high-throughput and reliable peptide identification via MS/MS spectra. Methods Mol Biol 2008. [PMID: 18592190 DOI: 10.1007/978-1-59745-398-1_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
One fundamental problem in proteomics study is to identify proteins and determine their expression levels in cells. Coupled with advanced liquid chromatography, tandem mass spectrometry has become the standard tool for peptide sequencing. In the past decade, many different algorithms and software packages have been developed to support high-throughput proteomics studies. This chapter reviews and compares the computational methods and software for the interpretation of tandem mass spectra. We also present techniques to assess the reliability of peptide identification. Finally, future directions and new research paradigms in tandem mass spectrometry are discussed.
Collapse
|
25
|
Martínez-Bartolomé S, Navarro P, Martín-Maroto F, López-Ferrer D, Ramos-Fernández A, Villar M, García-Ruiz JP, Vázquez J. Properties of average score distributions of SEQUEST: the probability ratio method. Mol Cell Proteomics 2008; 7:1135-45. [PMID: 18303013 DOI: 10.1074/mcp.m700239-mcp200] [Citation(s) in RCA: 117] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
High throughput identification of peptides in databases from tandem mass spectrometry data is a key technique in modern proteomics. Common approaches to interpret large scale peptide identification results are based on the statistical analysis of average score distributions, which are constructed from the set of best scores produced by large collections of MS/MS spectra by using searching engines such as SEQUEST. Other approaches calculate individual peptide identification probabilities on the basis of theoretical models or from single-spectrum score distributions constructed by the set of scores produced by each MS/MS spectrum. In this work, we study the mathematical properties of average SEQUEST score distributions by introducing the concept of spectrum quality and expressing these average distributions as compositions of single-spectrum distributions. We predict and demonstrate in the practice that average score distributions are dominated by the quality distribution in the spectra collection, except in the low probability region, where it is possible to predict the dependence of average probability on database size. Our analysis leads to a novel indicator, the probability ratio, which takes optimally into account the statistical information provided by the first and second best scores. The probability ratio is a non-parametric and robust indicator that makes spectra classification according to parameters such as charge state unnecessary and allows a peptide identification performance, on the basis of false discovery rates, that is better than that obtained by other empirical statistical approaches. The probability ratio also compares favorably with statistical probability indicators obtained by the construction of single-spectrum SEQUEST score distributions. These results make the robustness, conceptual simplicity, and ease of automation of the probability ratio algorithm a very attractive alternative to determine peptide identification confidences and error rates in high throughput experiments.
Collapse
Affiliation(s)
- Salvador Martínez-Bartolomé
- Protein Chemistry and Proteomics Laboratory, Centro de Biología Molecular "Severo Ochoa"-Consejo Superior de Investigaciones Científicas, 28049 Cantoblanco, Madrid, Spain
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Abstract
Protein identification using mass spectrometry is an indispensable computational tool in the life sciences. A dramatic increase in the use of proteomic strategies to understand the biology of living systems generates an ongoing need for more effective, efficient, and accurate computational methods for protein identification. A wide range of computational methods, each with various implementations, are available to complement different proteomic approaches. A solid knowledge of the range of algorithms available and, more critically, the accuracy and effectiveness of these techniques is essential to ensure as many of the proteins as possible, within any particular experiment, are correctly identified. Here, we undertake a systematic review of the currently available methods and algorithms for interpreting, managing, and analyzing biological data associated with protein identification. We summarize the advances in computational solutions as they have responded to corresponding advances in mass spectrometry hardware. The evolution of scoring algorithms and metrics for automated protein identification are also discussed with a focus on the relative performance of different techniques. We also consider the relative advantages and limitations of different techniques in particular biological contexts. Finally, we present our perspective on future developments in the area of computational protein identification by considering the most recent literature on new and promising approaches to the problem as well as identifying areas yet to be explored and the potential application of methods from other areas of computational biology.
Collapse
|
27
|
Zhang J, Li J, Liu X, Xie H, Zhu Y, He F. A nonparametric model for quality control of database search results in shotgun proteomics. BMC Bioinformatics 2008; 9:29. [PMID: 18205957 PMCID: PMC2267700 DOI: 10.1186/1471-2105-9-29] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Accepted: 01/21/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Analysis of complex samples with tandem mass spectrometry (MS/MS) has become routine in proteomic research. However, validation of database search results creates a bottleneck in MS/MS data processing. Recently, methods based on a randomized database have become popular for quality control of database search results. However, a consequent problem is the ignorance of how to combine different database search scores to improve the sensitivity of randomized database methods. RESULTS In this paper, a multivariate nonlinear discriminate function (DF) based on the multivariate nonparametric density estimation technique was used to filter out false-positive database search results with a predictable false positive rate (FPR). Application of this method to control datasets of different instruments (LCQ, LTQ, and LTQ/FT) yielded an estimated FPR close to the actual FPR. As expected, the method was more sensitive when more features were used. Furthermore, the new method was shown to be more sensitive than two commonly used methods on 3 complex sample datasets and 3 control datasets. CONCLUSION Using the nonparametric model, a more flexible DF can be obtained, resulting in improved sensitivity and good FPR estimation. This nonparametric statistical technique is a powerful tool for tackling the complexity and diversity of datasets in shotgun proteomics.
Collapse
Affiliation(s)
- Jiyang Zhang
- College of Mechanical & Electronic Engineering and Automatization, National University of Defense Technology, Changsha, 410073, China.
| | | | | | | | | | | |
Collapse
|
28
|
Zhang J, Li J, Xie H, Zhu Y, He F. A new strategy to filter out false positive identifications of peptides in SEQUEST database search results. Proteomics 2008; 7:4036-44. [PMID: 17952874 DOI: 10.1002/pmic.200600929] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Based on the randomized database method and a linear discriminant function (LDF) model, a new strategy to filter out false positive matches in SEQUEST database search results is proposed. Given an experiment MS/MS dataset and a protein sequence database, a randomized database is constructed and merged with the original database. Then, all MS/MS spectra are searched against the combined database. For each expected false positive rate (FPR), LDFs are constructed for different charge states and used to filter out the false positive matches from the normal database. In order to investigate the error of FPR estimation, the new strategy was applied to a reference dataset. As a result, the estimated FPR was very close to the actual FPR. While applied to a human K562 cell line dataset, which is a complicated dataset from real sample, more matches could be confirmed than the traditional cutoff-based methods at the same estimated FPR. Also, though most of the results confirmed by the LDF model were consistent with those of PeptideProphet, the LDF model could still provide complementary information. These results indicate that the new method can reliably control the FPR of peptide identifications and is more sensitive than traditional cutoff-based methods.
Collapse
Affiliation(s)
- Jiyang Zhang
- College of Mechanical and Electronic Engineering and Automatization, National University of Defense Technology, Changsha, China
| | | | | | | | | |
Collapse
|
29
|
Choi H, Nesvizhskii AI. Semisupervised Model-Based Validation of Peptide Identifications in Mass Spectrometry-Based Proteomics. J Proteome Res 2008; 7:254-65. [DOI: 10.1021/pr070542g] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
30
|
Tsur D, Tanner S, Zandi E, Bafna V, Pevzner PA. Identification of post-translational modifications via blind search of mass-spectra. PROCEEDINGS. IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2007:157-66. [PMID: 16447973 DOI: 10.1109/csb.2005.34] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Post-translational modifications (PTMs) are of great biological importance. Most existing approaches perform a restrictive search that can only take into account a few types of PTMs and ignore all others. We describe an unrestrictive PTM search algorithm that searches for all types of PTMs at once in a blind mode, i.e., without knowing which PTMs exist in a sample. The blind PTM identification opens a possibility to study the extent and frequencies of different types of PTMs, still an open problem in proteomics. Using our new algorithm, we were able to construct a two-dimensional PTM frequency matrix that reflects the number of MS/MS spectra in a sample for each putative PTM type and each amino acid. Application of this approach to a large IKKb dataset resulted in the largest set of PTMs reported for a single MS/MS sample so far. We demonstrate an excellent correlation between high values in the PTM frequency matrix and known PTMs thus validating our approach. We further argue that the PTM frequency matrix may reveal some still unknown modifications that warrant further experimental validation.
Collapse
Affiliation(s)
- Dekel Tsur
- Computer Science and Engineering, University of California at San Diego, USA.
| | | | | | | | | |
Collapse
|
31
|
Kolker E, Higdon R, Hogan JM. Protein identification and expression analysis using mass spectrometry. Trends Microbiol 2006; 14:229-35. [PMID: 16603360 DOI: 10.1016/j.tim.2006.03.005] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2005] [Revised: 03/02/2006] [Accepted: 03/22/2006] [Indexed: 11/28/2022]
Abstract
The identification and quantification of the proteins that a whole organism expresses under certain conditions is a main focus of high-throughput proteomics. Advanced proteomics approaches generate new biologically relevant data and potent hypotheses. A practical report of what proteome studies can and cannot accomplish in common laboratory settings is presented here. The review discusses the most popular tandem mass-spectrometry-based methods and focuses on how to produce reliable results. A step-by-step description of proteome experiments is given, including sample preparation, digestion, labeling, liquid chromatography, data processing, database searching and statistical analysis. The difficulties and bottlenecks of proteome analysis are addressed and the requirements for further improvements are discussed. Several diverse high-throughput proteomics-based studies of microorganisms are described.
Collapse
Affiliation(s)
- Eugene Kolker
- The BIATECH Institute, 19310 North Creek Parkway, Suite 115, Bothell, WA 98011, USA.
| | | | | |
Collapse
|
32
|
Wu FX, Gagné P, Droit A, Poirier GG. RT-PSM, a real-time program for peptide-spectrum matching with statistical significance. RAPID COMMUNICATIONS IN MASS SPECTROMETRY : RCM 2006; 20:1199-208. [PMID: 16541396 DOI: 10.1002/rcm.2435] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
The analysis of complex biological peptide mixtures by tandem mass spectrometry (MS/MS) produces a huge body of collision-induced dissociation (CID) MS/MS spectra. Several methods have been developed for identifying peptide-spectrum matches (PSMs) by assigning MS/MS spectra to peptides in a database. However, most of these methods either do not give the statistical significance of PSMs (e.g., SEQUEST) or employ time-consuming computational methods to estimate the statistical significance (e.g., PeptideProphet). In this paper, we describe a new algorithm, RT-PSM, which can be used to identify PSMs and estimate their accuracy statistically in real time. RT-PSM first computes PSM scores between an MS/MS spectrum and a set of candidate peptides whose masses are within a preset tolerance of the MS/MS precursor ion mass. Then the computed PSM scores of all candidate peptides are employed to fit the expectation value distribution of the scores into a second-degree polynomial function in PSM score. The statistical significance of the best PSM is estimated by extrapolating the fitting polynomial function to the best PSM score. RT-PSM was tested on two pairs of MS/MS spectrum datasets and protein databases to investigate its performance. The MS/MS spectra were acquired using an ion trap mass spectrometer equipped with a nano-electrospray ionization source. The results show that RT-PSM has good sensitivity and specificity. Using a 55,577-entry protein database and running on a standard Pentium-4, 2.8-GHz CPU personal computer, RT-PSM can process peptide spectra on a sequential, one-by-one basis in 0.047 s on average, compared to more than 7 s per spectrum on average for Sequest and X!Tandem, in their current batch-mode processing implementations. RT-PSM is clearly shown to be fast enough for real-time PSM assignment of MS/MS spectra generated every 3 s or so by a 3D ion trap or by a QqTOF instrument.
Collapse
Affiliation(s)
- Fang-Xiang Wu
- Health and Environment Unit, Laval University Medical Research Center (CHUL), Faculty of Medicine, 2705 Boul. Laurier, Quebec City, QC Canada, G1V 4G2
| | | | | | | |
Collapse
|
33
|
Salmi J, Moulder R, Filén JJ, Nevalainen OS, Nyman TA, Lahesmaa R, Aittokallio T. Quality classification of tandem mass spectrometry data. Bioinformatics 2005; 22:400-6. [PMID: 16352652 DOI: 10.1093/bioinformatics/bti829] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
UNLABELLED Peptide identification by tandem mass spectrometry is an important tool in proteomic research. Powerful identification programs exist, such as SEQUEST, ProICAT and Mascot, which can relate experimental spectra to the theoretical ones derived from protein databases, thus removing much of the manual input needed in the identification process. However, the time-consuming validation of the peptide identifications is still the bottleneck of many proteomic studies. One way to further streamline this process is to remove those spectra that are unlikely to provide a confident or valid peptide identification, and in this way to reduce the labour from the validation phase. RESULTS We propose a prefiltering scheme for evaluating the quality of spectra before the database search. The spectra are classified into two classes: spectra which contain valuable information for peptide identification and spectra that are not derived from peptides or contain insufficient information for interpretation. The different spectral features developed for the classification are tested on a real-life material originating from human lymphoblast samples and on a standard mixture of 9 proteins, both labelled with the ICAT-reagent. The results show that the prefiltering scheme efficiently separates the two spectra classes.
Collapse
Affiliation(s)
- Jussi Salmi
- Department of Information Technology and Turku Centre for Computer Science, University of Turku, Finland.
| | | | | | | | | | | | | |
Collapse
|
34
|
Tsur D, Tanner S, Zandi E, Bafna V, Pevzner PA. Identification of post-translational modifications by blind search of mass spectra. Nat Biotechnol 2005; 23:1562-7. [PMID: 16311586 DOI: 10.1038/nbt1168] [Citation(s) in RCA: 220] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2005] [Accepted: 10/20/2005] [Indexed: 11/09/2022]
Abstract
Most tandem mass spectrometry (MS/MS) database search algorithms perform a restrictive search that takes into account only a few types of post-translational modifications (PTMs) and ignores all others. We describe an unrestrictive PTM search algorithm, MS-Alignment, that searches for all types of PTMs at once in a blind mode, that is, without knowing which PTMs exist in nature. Blind PTM identification makes it possible to study the extent and frequency of different types of PTMs, still an open problem in proteomics. Application of this approach to lens proteins resulted in the largest set of PTMs reported in human crystallins so far. Our analysis of various MS/MS data sets implies that the biological phenomenon of modification is much more widespread than previously thought. We also argue that MS-Alignment reveals some uncharacterized modifications that warrant further experimental validation.
Collapse
Affiliation(s)
- Dekel Tsur
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, California 92093-0404, USA
| | | | | | | | | |
Collapse
|
35
|
Panchaud A, Kussmann M, Affolter M. Rapid enrichment of bioactive milk proteins and iterative, consolidated protein identification by multidimensional protein identification technology. Proteomics 2005; 5:3836-46. [PMID: 16145709 DOI: 10.1002/pmic.200401236] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Direct injection of complex protein mixtures, e.g. those derived from crude biological fluids, is often incompatible with conventional LC supports, because of column clogging and rapid deterioration of chromatographic performance. In this paper, we report the use of restricted access media to rapidly enrich and fractionate human breast milk. This resin, combining size exclusion and anion exchange functionalities, yielded a fraction enriched in soluble CD14 and showing specific sCD14-dependant activity. This fraction was split into five aliquots, which were individually characterized using multidimensional protein identification technology. Reproducibility of the results was addressed by analysing and comparing five datasets using different protein identification tools available within the Sequest software. Furthermore, a comparison of three major releases of the Ensembl human protein database was performed to examine the effect of database updates on our results. We report here the benefit of repeated analysis of aliquots of the same fraction: first to increase the confidence in peptide identification by repeated confirmation in several aliquots; and second to assess experimental reproducibility. We demonstrate furthermore the effect of database modifications on the results and the importance of constantly re-analysing data with new releases to keep them consistent and up to date with the latest protein identities and predictions available.
Collapse
Affiliation(s)
- Alexandre Panchaud
- Functional Genomics Group, Department of Bioanalytical Science, Nestle Research Centre, Lausanne, Switzerland
| | | | | |
Collapse
|
36
|
Frank A, Tanner S, Bafna V, Pevzner P. Peptide Sequence Tags for Fast Database Search in Mass-Spectrometry. J Proteome Res 2005; 4:1287-95. [PMID: 16083278 DOI: 10.1021/pr050011x] [Citation(s) in RCA: 107] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Filtration techniques in the form of rapid elimination of candidate sequences while retaining the true one are key ingredients of database searches in genomics. Although SEQUEST and Mascot perform a conceptually similar task to the tool BLAST, the key algorithmic idea of BLAST (filtration) was never implemented in these tools. As a result MS/MS protein identification tools are becoming too time-consuming for many applications including search for post-translationally modified peptides. Moreover, matching millions of spectra against all known proteins will soon make these tools too slow in the same way that "genome vs genome" comparisons instantly made BLAST too slow. We describe the development of filters for MS/MS database searches that dramatically reduce the running time and effectively remove the bottlenecks in searching the huge space of protein modifications. Our approach, based on a probability model for determining the accuracy of sequence tags, achieves superior results compared to GutenTag, a popular tag generation algorithm. Our tag generating algorithm along with our de novo sequencing algorithm PepNovo can be accessed via the URL http://peptide.ucsd.edu/.
Collapse
Affiliation(s)
- Ari Frank
- Department of Computer Science & Engineering, University of California-San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0114, USA.
| | | | | | | |
Collapse
|
37
|
Ferry-Dumazet H, Houel G, Montalent P, Moreau L, Langella O, Negroni L, Vincent D, Lalanne C, de Daruvar A, Plomion C, Zivy M, Joets J. PROTICdb: A web-based application to store, track, query, and compare plant proteome data. Proteomics 2005; 5:2069-81. [PMID: 15846840 DOI: 10.1002/pmic.200401111] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
PROTICdb is a web-based application, mainly designed to store and analyze plant proteome data obtained by two-dimensional polyacrylamide gel electrophoresis (2-D PAGE) and mass spectrometry (MS). The purposes of PROTICdb are (i) to store, track, and query information related to proteomic experiments, i.e., from tissue sampling to protein identification and quantitative measurements, and (ii) to integrate information from the user's own expertise and other sources into a knowledge base, used to support data interpretation (e.g., for the determination of allelic variants or products of post-translational modifications). Data insertion into the relational database of PROTICdb is achieved either by uploading outputs of image analysis and MS identification software, or by filling web forms. 2-D PAGE annotated maps can be displayed, queried, and compared through a graphical interface. Links to external databases are also available. Quantitative data can be easily exported in a tabulated format for statistical analyses. PROTICdb is based on the Oracle or the PostgreSQL Database Management System and is freely available upon request at the following URL: http://moulon.inra.fr/ bioinfo/PROTICdb.
Collapse
Affiliation(s)
- Hélène Ferry-Dumazet
- Centre de Bioinformatique de Bordeaux, Université Victor Segalen Bordeaux 2, France
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
López-Ferrer D, Martínez-Bartolomé S, Villar M, Campillos M, Martín-Maroto F, Vázquez J. Statistical Model for Large-Scale Peptide Identification in Databases from Tandem Mass Spectra Using SEQUEST. Anal Chem 2004; 76:6853-60. [PMID: 15571333 DOI: 10.1021/ac049305c] [Citation(s) in RCA: 89] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Recent technological advances have made multidimensional peptide separation techniques coupled with tandem mass spectrometry the method of choice for high-throughput identification of proteins. Due to these advances, the development of software tools for large-scale, fully automated, unambiguous peptide identification is highly necessary. In this work, we have used as a model the nuclear proteome from Jurkat cells and present a processing algorithm that allows accurate predictions of random matching distributions, based on the two SEQUEST scores Xcorr and DeltaCn. Our method permits a very simple and precise calculation of the probabilities associated with individual peptide assignments, as well as of the false discovery rate among the peptides identified in any experiment. A further mathematical analysis demonstrates that the score distributions are highly dependent on database size and precursor mass window and suggests that the probability associated with SEQUEST scores depends on the number of candidate peptide sequences available for the search. Our results highlight the importance of adjusting the filtering criteria to discriminate between correct and incorrect peptide sequences according to the circumstances of each particular experiment.
Collapse
Affiliation(s)
- Daniel López-Ferrer
- Centro de Biología Molecular Severo Ochoa-CSIC, 28049 Cantoblanco, Madrid, Spain
| | | | | | | | | | | |
Collapse
|
39
|
Sun W, Li F, Wang J, Zheng D, Gao Y. AMASS: Software for Automatically Validating the Quality of MS/MS Spectrum from SEQUEST Results. Mol Cell Proteomics 2004; 3:1194-9. [PMID: 15489460 DOI: 10.1074/mcp.m400120-mcp200] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Time-consuming and experience-dependent manual validations of tandem mass spectra are usually applied to SEQUEST results. This inefficient method has become a significant bottleneck for MS/MS data processing. Here we introduce a program AMASS (advanced mass spectrum screener), which can filter the tandem mass spectra of SEQUEST results by measuring the match percentage of high-abundant ions and the continuity of matched fragment ions in b, y series. Compared with Xcorr and DeltaCn filter, AMASS can increase the number of positives and reduce the number of negatives in 22 datasets generated from 18 known protein mixtures. It effectively removed most noisy spectra, false interpretations, and about half of poor fragmentation spectra, and AMASS can work synergistically with Rscore filter. We believe the use of AMASS and Rscore can result in a more accurate identification of peptide MS/MS spectra and reduce the time and energy for manual validation.
Collapse
Affiliation(s)
- Wei Sun
- Proteomics Research Center, National Key Laboratory of Medical Molecular, Biology, Institute of Basic Medical Sciences, Chinese Academy of Medical, Sciences, Beijing, People's Republic of China
| | | | | | | | | |
Collapse
|
40
|
Abstract
A 35 kDa protein was purified from rat spinal ganglia and sensory fibers. Combined direct trypsin digest and liquid chromatography ion trap mass spectrometry analysis, the 35 kDa protein was identified as annexin V. We then studied the distribution of serum antibodies to annexin V in patients with peripheral neuropathy. We found serum positive antibodies to annexin V only in some patients with immune-mediated neuropathy. This indicated that humoral immune responses to annexin V might play a role in the pathogenesis of autoimmune sensory neuropathy or sensory neuronopathy.
Collapse
Affiliation(s)
- Quan Li
- Department of Molecular and Cellular Pharmacology, College of Pharmaceutical Sciences, Peking University, 38 Xueyuan Road, Haidian District, Beijing 100083, PR China
| | | | | |
Collapse
|
41
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2004. [PMCID: PMC2447433 DOI: 10.1002/cfg.356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|