1
|
Tagirdzhanov AM, Shlemov A, Gurevich A. NPS: scoring and evaluating the statistical significance of peptidic natural product-spectrum matches. Bioinformatics 2020; 35:i315-i323. [PMID: 31510666 PMCID: PMC6612854 DOI: 10.1093/bioinformatics/btz374] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Peptidic natural products (PNPs) are considered a promising compound class that has many applications in medicine. Recently developed mass spectrometry-based pipelines are transforming PNP discovery into a high-throughput technology. However, the current computational methods for PNP identification via database search of mass spectra are still in their infancy and could be substantially improved. RESULTS Here we present NPS, a statistical learning-based approach for scoring PNP-spectrum matches. We incorporated NPS into two leading PNP discovery tools and benchmarked them on millions of natural product mass spectra. The results demonstrate more than 45% increase in the number of identified spectra and 20% more found PNPs at a false discovery rate of 1%. AVAILABILITY AND IMPLEMENTATION NPS is available as a command line tool and as a web application at http://cab.spbu.ru/software/NPS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Azat M Tagirdzhanov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia.,Department of Higher Mathematics, St. Petersburg Electrotechnical University "LETI", St. Petersburg, Russia
| | - Alexander Shlemov
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| |
Collapse
|
2
|
Cotton J, Leroux F, Broudin S, Marie M, Corman B, Tabet JC, Ducruix C, Junot C. High-resolution mass spectrometry associated with data mining tools for the detection of pollutants and chemical characterization of honey samples. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2014; 62:11335-45. [PMID: 25358104 DOI: 10.1021/jf504400c] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Analytical methods for food control are mainly focused on restricted lists of well-known contaminants. This paper shows that liquid chromatography-high-resolution mass spectrometry (LC/ESI-HRMS) associated with the data mining tools developed for metabolomics can address this issue by enabling (i) targeted analyses of pollutants, (ii) detection of untargeted and unknown xenobiotics, and (iii) detection of metabolites useful for the characterization of food matrices. A proof-of-concept study was performed on 76 honey samples. Targeted analysis indicated that 35 of 83 targeted molecules were detected in the 76 honey samples at concentrations below regulatory limits. Furthermore, untargeted metabolomic-like analyses highlighted 12 chlorinated xenobiotics, 1 of which was detected in lavender honey samples and identified as 2,6-dichlorobenzamide, a metabolite of dichlobenil, a pesticide banned in France since 2010. Lastly, multivariate statistical analyses discriminated honey samples according to their floral origin, and six discriminating metabolites were characterized thanks to the MS/MS experiments.
Collapse
Affiliation(s)
- Jérôme Cotton
- CEA, iBiTec-S, Service de Pharmacologie et d'Immunoanalyse, Laboratoire d'Etude du Métabolisme des Médicaments, MetaboHUB Paris, 91191 Gif-sur-Yvette, France
| | | | | | | | | | | | | | | |
Collapse
|
3
|
Peterson ES, McCue LA, Schrimpe-Rutledge AC, Jensen JL, Walker H, Kobold MA, Webb SR, Payne SH, Ansong C, Adkins JN, Cannon WR, Webb-Robertson BJM. VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data. BMC Genomics 2012; 13:131. [PMID: 22480257 PMCID: PMC3364912 DOI: 10.1186/1471-2164-13-131] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2011] [Accepted: 04/05/2012] [Indexed: 11/10/2022] Open
Abstract
Background The procedural aspects of genome sequencing and assembly have become relatively inexpensive, yet the full, accurate structural annotation of these genomes remains a challenge. Next-generation sequencing transcriptomics (RNA-Seq), global microarrays, and tandem mass spectrometry (MS/MS)-based proteomics have demonstrated immense value to genome curators as individual sources of information, however, integrating these data types to validate and improve structural annotation remains a major challenge. Current visual and statistical analytic tools are focused on a single data type, or existing software tools are retrofitted to analyze new data forms. We present Visual Exploration and Statistics to Promote Annotation (VESPA) is a new interactive visual analysis software tool focused on assisting scientists with the annotation of prokaryotic genomes though the integration of proteomics and transcriptomics data with current genome location coordinates. Results VESPA is a desktop Java™ application that integrates high-throughput proteomics data (peptide-centric) and transcriptomics (probe or RNA-Seq) data into a genomic context, all of which can be visualized at three levels of genomic resolution. Data is interrogated via searches linked to the genome visualizations to find regions with high likelihood of mis-annotation. Search results are linked to exports for further validation outside of VESPA or potential coding-regions can be analyzed concurrently with the software through interaction with BLAST. VESPA is demonstrated on two use cases (Yersinia pestis Pestoides F and Synechococcus sp. PCC 7002) to demonstrate the rapid manner in which mis-annotations can be found and explored in VESPA using either proteomics data alone, or in combination with transcriptomic data. Conclusions VESPA is an interactive visual analytics tool that integrates high-throughput data into a genomic context to facilitate the discovery of structural mis-annotations in prokaryotic genomes. Data is evaluated via visual analysis across multiple levels of genomic resolution, linked searches and interaction with existing bioinformatics tools. We highlight the novel functionality of VESPA and core programming requirements for visualization of these large heterogeneous datasets for a client-side application. The software is freely available at https://www.biopilot.org/docs/Software/Vespa.php.
Collapse
Affiliation(s)
- Elena S Peterson
- Scientific Data Management, Pacific Northwest National Laboratory, Richland, WA, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Yu C, Lin Y, Sun S, Cai J, Zhang J, Bu D, Zhang Z, Chen R. AN ITERATIVE ALGORITHM TO QUANTIFY FACTORS INFLUENCING PEPTIDE FRAGMENTATION DURING TANDEM MASS SPECTROMETRY. J Bioinform Comput Biol 2011; 5:297-311. [PMID: 17589963 DOI: 10.1142/s0219720007002643] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2006] [Revised: 01/02/2007] [Accepted: 01/22/2007] [Indexed: 11/18/2022]
Abstract
In protein identification by tandem mass spectrometry, it is critical to accurately predict the theoretical spectrum for a peptide sequence. To date, the widely-used database searching methods adopted simple statistical models for predicting. For some peptide, these models usually yield a theoretical spectrum with a significant deviation from the experimental one. In this paper, in order to derive an improved predicting model, we utilized a non-linear programming model to quantify the factors impacting peptide fragmentation. Then, an iterative algorithm was proposed to solve this optimization problem. Upon a training set of 1803 spectra, the experimental result showed a good agreement with some known principles about peptide fragmentation, such as the tendency to cleave at the middle of peptide, and Pro's preference of the N-terminal cleavage. Moreover, upon a testing set of 941 spectra, comparison of the predicted spectra against the experimental ones showed that this method can generate reasonable predictions. The results in this paper can offer help to both database searching and de novo methods.
Collapse
Affiliation(s)
- Chungong Yu
- Bioinformatics Lab, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China.
| | | | | | | | | | | | | | | |
Collapse
|
5
|
Cannon WR, Rawlins MM, Baxter DJ, Callister SJ, Lipton MS, Bryant DA. Large improvements in MS/MS-based peptide identification rates using a hybrid analysis. J Proteome Res 2011; 10:2306-17. [PMID: 21391700 DOI: 10.1021/pr101130b] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
We report a hybrid search method combining database and spectral library searches that allows for a straightforward approach to characterizing the error rates from the combined data. Using these methods, we demonstrate significantly increased sensitivity and specificity in matching peptides to tandem mass spectra. The hybrid search method increased the number of spectra that can be assigned to a peptide in a global proteomics study by 57-147% at an estimated false discovery rate of 5%, with clear room for even greater improvements. The approach combines the general utility of using consensus model spectra typical of database search methods with the accuracy of the intensity information contained in spectral libraries. A common scoring metric based on recent developments linking data analysis and statistical thermodynamics is used, which allows the use of a conservative estimate of error rates for the combined data. We applied this approach to proteomics analysis of Synechococcus sp. PCC 7002, a cyanobacterium that is a model organism for studies of photosynthetic carbon fixation and biofuels development. The increased specificity and sensitivity of this approach allowed us to identify many more peptides involved in the processes important for photoautotrophic growth.
Collapse
Affiliation(s)
- William R Cannon
- Computational Biology and Bioinformatics Group, Pacific Northwest National Laboratory, Richland, Washington 99352, United States.
| | | | | | | | | | | |
Collapse
|
6
|
Webb-Robertson BJM, McCue LA, Waters KM, Matzke MM, Jacobs JM, Metz TO, Varnum SM, Pounds JG. Combined statistical analyses of peptide intensities and peptide occurrences improves identification of significant peptides from MS-based proteomics data. J Proteome Res 2010; 9:5748-56. [PMID: 20831241 PMCID: PMC2974810 DOI: 10.1021/pr1005247] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
Liquid chromatography−mass spectrometry-based (LC−MS) proteomics uses peak intensities of proteolytic peptides to infer the differential abundance of peptides/proteins. However, substantial run-to-run variability in intensities and observations (presence/absence) of peptides makes data analysis quite challenging. The missing observations in LC−MS proteomics data are difficult to address with traditional imputation-based approaches because the mechanisms by which data are missing are unknown a priori. Data can be missing due to random mechanisms such as experimental error or nonrandom mechanisms such as a true biological effect. We present a statistical approach that uses a test of independence known as a G-test to test the null hypothesis of independence between the number of missing values across experimental groups. We pair the G-test results, evaluating independence of missing data (IMD) with an analysis of variance (ANOVA) that uses only means and variances computed from the observed data. Each peptide is therefore represented by two statistical confidence metrics, one for qualitative differential observation and one for quantitative differential intensity. We use three LC−MS data sets to demonstrate the robustness and sensitivity of the IMD−ANOVA approach. Missing abundance values in LC−MS data are difficult to analyze statistically because the mechanisms by which the data are missing are unknown (processing or biological effect). We present a new approach that pairs a test of independence on missing data to discern qualitative difference across treatment groups with traditional statistical tests that evaluate quantitative differences. The combination of these two statistics yields a more robust statistical description of the data.
Collapse
|
7
|
Klammer AA, Park CY, Noble WS. Statistical calibration of the SEQUEST XCorr function. J Proteome Res 2009; 8:2106-13. [PMID: 19275164 DOI: 10.1021/pr8011107] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments requires a score function that consistently ranks correct peptide-spectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function Xcorr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrum-specific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to Xcorr and the preliminary Sequest score function Sp. The protocol accounts for spectrum- and peptide-specific effects by calculating p values for each spectrum individually, using only that spectrum's score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore, eliminating the need for an extra search against a decoy database. In addition, we show that the pvalues are better calibrated than their underlying scores; consequently, when ranking top-scoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.
Collapse
Affiliation(s)
- Aaron A Klammer
- Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA
| | | | | |
Collapse
|
8
|
Abstract
The analysis of the large volume of tandem mass spectrometry (MS/MS) proteomics data that is generated these days relies on automated algorithms that identify peptides from their mass spectra. An essential component of these algorithms is the scoring function used to evaluate the quality of peptide-spectrum matches (PSMs). In this paper, we present new approach to scoring of PSMs. We argue that since this problem is at its core a ranking task (especially in the case of de novo sequencing), it can be solved effectively using machine learning ranking algorithms. We developed a new discriminative boosting-based approach to scoring. Our scoring models draw upon a large set of diverse feature functions that measure different qualities of PSMs. Our method improves the performance of our de novo sequencing algorithm beyond the current state-of-the-art, and also greatly enhances the performance of database search programs. Furthermore, by increasing the efficiency of tag filtration and improving the sensitivity of PSM scoring, we make it practical to perform large-scale MS/MS analysis, such as proteogenomic search of a six-frame translation of the human genome (in which we achieve a reduction of the running time by a factor of 15 and a 60% increase in the number of identified peptides, compared to the InsPecT database search tool). Our scoring function is incorporated into PepNovo+ which is available for download or can be run online at http://bix.ucsd.edu.
Collapse
Affiliation(s)
- Ari M Frank
- Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, Mail Code 0404 La Jolla, California 92093-0404, USA.
| |
Collapse
|
9
|
Webb-Robertson BJM. Support vector machines for improved peptide identification from tandem mass spectrometry database search. Methods Mol Biol 2009; 492:453-460. [PMID: 19241051 DOI: 10.1007/978-1-59745-493-3_28] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Accurate identification of peptides is a current challenge in mass spectrometry (MS)-based proteomics. The standard approach uses a search routine to compare tandem mass spectra to a database of peptides associated with the target organism. These database search routines yield multiple metrics associated with the quality of the mapping of the experimental spectrum to the theoretical spectrum of a peptide. The structure of these results make separating correct from false identifications difficult and has created a false identification problem. Statistical confidence scores are an approach to battle this false positive problem that has led to significant improvements in peptide identification. We have shown that machine learning, specifically support vector machine (SVM), is an effective approach to separating true peptide identifications from false ones. The SVM-based peptide statistical scoring method transforms a peptide into a vector representation based on database search metrics to train and validate the SVM. In practice, following the database search routine, a peptide is denoted in its vector representation and the SVM generates a single statistical score that is then used to classify presence or absence in the sample.
Collapse
|
10
|
Allmer J, Kuhlgert S, Hippler M. 2DB: a Proteomics database for storage, analysis, presentation, and retrieval of information from mass spectrometric experiments. BMC Bioinformatics 2008; 9:302. [PMID: 18605993 PMCID: PMC2475538 DOI: 10.1186/1471-2105-9-302] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2008] [Accepted: 07/07/2008] [Indexed: 11/26/2022] Open
Abstract
Background The amount of information stemming from proteomics experiments involving (multi dimensional) separation techniques, mass spectrometric analysis, and computational analysis is ever-increasing. Data from such an experimental workflow needs to be captured, related and analyzed. Biological experiments within this scope produce heterogenic data ranging from pictures of one or two-dimensional protein maps and spectra recorded by tandem mass spectrometry to text-based identifications made by algorithms which analyze these spectra. Additionally, peptide and corresponding protein information needs to be displayed. Results In order to handle the large amount of data from computational processing of mass spectrometric experiments, automatic import scripts are available and the necessity for manual input to the database has been minimized. Information is in a generic format which abstracts from specific software tools typically used in such an experimental workflow. The software is therefore capable of storing and cross analysing results from many algorithms. A novel feature and a focus of this database is to facilitate protein identification by using peptides identified from mass spectrometry and link this information directly to respective protein maps. Additionally, our application employs spectral counting for quantitative presentation of the data. All information can be linked to hot spots on images to place the results into an experimental context. A summary of identified proteins, containing all relevant information per hot spot, is automatically generated, usually upon either a change in the underlying protein models or due to newly imported identifications. The supporting information for this report can be accessed in multiple ways using the user interface provided by the application. Conclusion We present a proteomics database which aims to greatly reduce evaluation time of results from mass spectrometric experiments and enhance result quality by allowing consistent data handling. Import functionality, automatic protein detection, and summary creation act together to facilitate data analysis. In addition, supporting information for these findings is readily accessible via the graphical user interface provided. The database schema and the implementation, which can easily be installed on virtually any server, can be downloaded in the form of a compressed file from our project webpage.
Collapse
Affiliation(s)
- Jens Allmer
- Institute for Plant Biochemistry and Biotechnology, University of Münster, Hindenburgplatz 55, Münster, Germany.
| | | | | |
Collapse
|
11
|
Vasilescu J, Smith JC, Zweitzig DR, Denis NJ, Haines DS, Figeys D. Systematic determination of ion score cutoffs based on calculated false positive rates: application for identifying ubiquitinated proteins by tandem mass spectrometry. JOURNAL OF MASS SPECTROMETRY : JMS 2008; 43:296-304. [PMID: 17957819 DOI: 10.1002/jms.1297] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
We report a simple approach for determining ion score cutoffs that permit the confident identification of ubiquitinated proteins by tandem mass spectrometry (MS/MS). Initial experiments involving the analysis of gel bands containing multi-Ubiquitin chains with quadrupole time-of-flight and quadrupole ion trap mass spectrometers revealed that standard ion score cutoffs used for database searching were not sufficiently stringent. We also found that false positive and false negative rates (FPR and FNR) varied significantly depending on the cutoff scores used and that appropriate cutoffs could only be determined following a systematic evaluation of false positive rates. When standard cutoff scores were used for the analysis of complex mixtures of ubiquitinated proteins, unacceptably high FPR were observed. Finally, we found that FPR for ubiquitinated proteins are affected by the size of the protein database that is searched. These observations may be applicable for the study of other post-translational modifications.
Collapse
Affiliation(s)
- Julian Vasilescu
- Ottawa Institute of Systems Biology, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada
| | | | | | | | | | | |
Collapse
|
12
|
Dodds ED, Clowers BH, Hagerman PJ, Lebrilla CB. Systematic characterization of high mass accuracy influence on false discovery and probability scoring in peptide mass fingerprinting. Anal Biochem 2007; 372:156-66. [PMID: 17980142 DOI: 10.1016/j.ab.2007.10.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2007] [Revised: 10/01/2007] [Accepted: 10/08/2007] [Indexed: 11/29/2022]
Abstract
Whereas the bearing of mass measurement error on protein identification is sometimes underestimated, uncertainty in observed peptide masses unavoidably translates to ambiguity in subsequent protein identifications. Although ongoing instrumental advances continue to make high accuracy mass spectrometry (MS) increasingly accessible, many proteomics experiments are still conducted with rather large mass error tolerances. In addition, the ranking schemes of most protein identification algorithms do not include a meaningful incorporation of mass measurement error. This article provides a critical evaluation of mass error tolerance as it pertains to false positive peptide and protein associations resulting from peptide mass fingerprint (PMF) database searching. High accuracy, high resolution PMFs of several model proteins were obtained using matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry (MALDI-FTICR-MS). Varying levels of mass accuracy were simulated by systematically modulating the mass error tolerance of the PMF query and monitoring the effect on figures of merit indicating the PMF quality. Importantly, the benefits of decreased mass error tolerance are not manifest in Mowse scores when operating at tolerances in the low parts-per-million range but become apparent with the consideration of additional metrics that are often overlooked. Furthermore, the outcomes of these experiments support the concept that false discovery is closely tied to mass measurement error in PMF analysis. Clear establishment of this relation demonstrates the need for mass error-aware protein identification routines and argues for a more prominent contribution of high accuracy mass measurement to proteomic science.
Collapse
Affiliation(s)
- Eric D Dodds
- Department of Chemistry, University of California, Davis, Davis, CA 95616, USA
| | | | | | | |
Collapse
|
13
|
Cannon WR, Taasevigen D, Baxter DJ, Laskin J. Evaluation of the influence of amino acid composition on the propensity for collision-induced dissociation of model peptides using molecular dynamics simulations. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2007; 18:1625-37. [PMID: 17651984 DOI: 10.1016/j.jasms.2007.06.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2007] [Revised: 06/13/2007] [Accepted: 06/14/2007] [Indexed: 05/16/2023]
Abstract
The dynamical behavior of model peptides was evaluated with respect to their ability to form internal proton donor-acceptor pairs using molecular dynamics simulations. The proton donor-acceptor pairs are postulated to be prerequisites for peptide bond cleavage resulting in formation of b and y ions during low-energy collision-induced dissociation in tandem mass spectrometry (MS/MS). The simulations for the polyalanine pentamer Ala(5)H(+) were compared with experimental data from energy-resolved surface induced dissociation (SID) studies. The results of the simulation are insightful into the events that likely lead up to the fragmentation of peptides. Nine-mer polyalanine-based model peptides were used to examine the dynamical effect of each of the 20 common amino acids on the probability to form donor-acceptor pairs at labile peptide bonds. A range of probabilities was observed as a function of the substituted amino acid. However, the location of the peptide bond involved in the donor-acceptor pair plays a critical role in the dynamical behavior. This influence of position on the probability of forming a donor-acceptor pair would be hard to predict from statistical analyses on experimental spectra of aggregate, diverse peptides. In addition, the inclusion of basic side chains in the model peptides alters the probability of forming donor-acceptor pairs across the entire backbone. In this case, there are still more ionizing protons than basic residues, but the side chains of the basic amino acids form stable hydrogen bond networks with the peptide carbonyl oxygens and thus act to prevent free access of "mobile protons" to labile peptide bonds. It is clear from the work that the identification of peptides from low-energy CID using automated computational methods should consider the location of the fragmenting bond as well as the amino acid composition.
Collapse
Affiliation(s)
- William R Cannon
- Computational Biology and Bioinformatics Group, Computational and Information Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, USA.
| | | | | | | |
Collapse
|
14
|
Sharp JL, Anderson KK, Hurst GB, Daly DS, Pelletier DA, Cannon WR, Auberry DL, Schmoyer DD, McDonald WH, White AM, Hooker BS, Victry KD, Buchanan MV, Kery V, Wiley HS. Statistically inferring protein-protein associations with affinity isolation LC-MS/MS assays. J Proteome Res 2007; 6:3788-95. [PMID: 17691832 DOI: 10.1021/pr0701106] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Affinity isolation of protein complexes followed by protein identification by LC-MS/MS is an increasingly popular approach for mapping protein interactions. However, systematic and random assay errors from multiple sources must be considered to confidently infer authentic protein-protein interactions. To address this issue, we developed a general, robust statistical method for inferring authentic interactions from protein prey-by-bait frequency tables using a binomial-based likelihood ratio test (LRT) coupled with Bayes' Odds estimation. We then applied our LRT-Bayes' algorithm experimentally using data from protein complexes isolated from Rhodopseudomonas palustris. Our algorithm, in conjunction with the experimental protocol, inferred with high confidence authentic interacting proteins from abundant, stable complexes, but few or no authentic interactions for lower-abundance complexes. The algorithm can discriminate against a background of prey proteins that are detected in association with a large number of baits as an artifact of the measurement. We conclude that the experimental protocol including the LRT-Bayes' algorithm produces results with high confidence but moderate sensitivity. We also found that Monte Carlo simulation is a feasible tool for checking modeling assumptions, estimating parameters, and evaluating the significance of results in protein association studies.
Collapse
Affiliation(s)
- Julia L Sharp
- Clemson University, 237 Barre Hall, Clemson, South Carolina 29634-0313, Pacific Northwest National Laboratory, P.O. Box 999, Richland, Washington 99352, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Havilio M, Wool A. Large-scale unrestricted identification of post-translation modifications using tandem mass spectrometry. Anal Chem 2007; 79:1362-8. [PMID: 17297935 DOI: 10.1021/ac061515x] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
TwinPeaks, a close variant of the SEQUEST protein identification algorithm, is capable of unrestricted, large-scale, identification of post-translation modifications (PTMs). TwinPeaks is applied on a sample of 100441 tandem mass spectra from the HUPO Plasma Proteome Project data set, with full non-redundant human as a reference protein database. With a 3.5% error rate, TwinPeaks identifies a collection of 539 spectra that were not identified by the usual PTM-restricted identification algorithm. At this error rate, TwinPeaks increases the rate of spectra identifications by at least 17.6%, making unrestricted PTM identification an integral part of proteomics.
Collapse
Affiliation(s)
- Moshe Havilio
- Notal Vision Limited, 5 Droyanov Street, Tel Aviv, 63143 Israel.
| | | |
Collapse
|
16
|
Higgs RE, Knierman MD, Freeman AB, Gelbert LM, Patil ST, Hale JE. Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. J Proteome Res 2007; 6:1758-67. [PMID: 17397207 DOI: 10.1021/pr0605320] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We present a wrapper-based approach to estimate and control the false discovery rate for peptide identifications using the outputs from multiple commercially available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score associated with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estimating p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.
Collapse
Affiliation(s)
- Richard E Higgs
- Lilly Research Laboratories, MS 1533, Lilly Corporate Center, Indianapolis, Indiana 46285, USA.
| | | | | | | | | | | |
Collapse
|