1
|
On the use of tandem mass spectra acquired from samples of evolutionarily distant organisms to validate methods for false discovery rate estimation. Proteomics 2024:e2300398. [PMID: 38491400 DOI: 10.1002/pmic.202300398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 03/01/2024] [Accepted: 03/06/2024] [Indexed: 03/18/2024]
Abstract
Estimating the false discovery rate (FDR) of peptide identifications is a key step in proteomics data analysis, and many methods have been proposed for this purpose. Recently, an entrapment-inspired protocol to validate methods for FDR estimation appeared in articles showcasing new spectral library search tools. That validation approach involves generating incorrect spectral matches by searching spectra from evolutionarily distant organisms (entrapment queries) against the original target search space. Although this approach may appear similar to the solutions using entrapment databases, it represents a distinct conceptual framework whose correctness has not been verified yet. In this viewpoint, we first discussed the background of the entrapment-based validation protocols and then conducted a few simple computational experiments to verify the assumptions behind them. The results reveal that entrapment databases may, in some implementations, be a reasonable choice for validation, while the assumptions underpinning validation protocols based on entrapment queries are likely to be violated in practice. This article also highlights the need for well-designed frameworks for validating FDR estimation methods in proteomics.
Collapse
|
2
|
MSBooster: improving peptide identification rates using deep learning-based features. Nat Commun 2023; 14:4539. [PMID: 37500632 PMCID: PMC10374903 DOI: 10.1038/s41467-023-40129-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 07/06/2023] [Indexed: 07/29/2023] Open
Abstract
Peptide identification in liquid chromatography-tandem mass spectrometry (LC-MS/MS) experiments relies on computational algorithms for matching acquired MS/MS spectra against sequences of candidate peptides using database search tools, such as MSFragger. Here, we present a new tool, MSBooster, for rescoring peptide-to-spectrum matches using additional features incorporating deep learning-based predictions of peptide properties, such as LC retention time, ion mobility, and MS/MS spectra. We demonstrate the utility of MSBooster, in tandem with MSFragger and Percolator, in several different workflows, including nonspecific searches (immunopeptidomics), direct identification of peptides from data independent acquisition data, single-cell proteomics, and data generated on an ion mobility separation-enabled timsTOF MS platform. MSBooster is fast, robust, and fully integrated into the widely used FragPipe computational platform.
Collapse
|
3
|
Modeling Lower-Order Statistics to Enable Decoy-Free FDR Estimation in Proteomics. J Proteome Res 2023; 22:1159-1171. [PMID: 36962508 DOI: 10.1021/acs.jproteome.2c00604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2023]
Abstract
One of the chief objectives in mass spectrometry-based peptide identification in proteomics is the statistical validation of top-scoring peptide-spectrum matches (PSMs) in the form of false discovery rate (FDR) estimation. Existing methods construct a null model that captures the characteristics of incorrect target PSMs to estimate the FDR, most often with the help of decoys. Decoy-based methods, however, increase the computational cost and rely on the difficult-to-verify assumption that decoy PSMs constitute a sufficient and representative sample of the population of possible incorrect target PSMs. On the other hand, the possibility of FDR estimation assisted by the plentiful non-top-scoring PSMs, which are almost always incorrect, has been scarcely explored. In this work, we propose a novel decoy-free procedure for developing null models for top-scoring PSMs using the transformed e-value (TEV) score and the distributions of non-top-scoring target PSMs. The method relies on a theoretically derivable relationship between the parameters of the distributions of lower-order statistics of the TEV score and a necessary empirical optimization to fit a single parameter to actual data. The framework was tested on multiple different data sets and two search engines. We present evidence that our method is comparable to and occasionally outperforms popular decoy-free and decoy-based methods in FDR estimation.
Collapse
|
4
|
Quality Control for the Target Decoy Approach for Peptide Identification. J Proteome Res 2023; 22:350-358. [PMID: 36648107 DOI: 10.1021/acs.jproteome.2c00423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Reliable peptide identification is key in mass spectrometry (MS) based proteomics. To this end, the target decoy approach (TDA) has become the cornerstone for extracting a set of reliable peptide-to-spectrum matches (PSMs) that will be used in downstream analysis. Indeed, TDA is now the default method to estimate the false discovery rate (FDR) for a given set of PSMs, and users typically view it as a universal solution for assessing the FDR in the peptide identification step. However, the TDA also relies on a minimal set of assumptions, which are typically never verified in practice. We argue that a violation of these assumptions can lead to poor FDR control, which can be detrimental to any downstream data analysis. We here therefore first clearly spell out these TDA assumptions, and introduce TargetDecoy, a Bioconductor package with all the necessary functionality to control the TDA quality and its underlying assumptions for a given set of PSMs.
Collapse
|
5
|
False discovery rate: the Achilles' heel of proteogenomics. Brief Bioinform 2022; 23:6582880. [PMID: 35534181 DOI: 10.1093/bib/bbac163] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 03/14/2022] [Accepted: 04/12/2022] [Indexed: 12/25/2022] Open
Abstract
Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.
Collapse
|
6
|
Abstract
In shotgun proteomics, false discovery rate (FDR) estimation is a necessary step to ensure the quality of accepted peptide-spectrum matches (PSMs) from a database search. Popular statistical validation tools for FDR control tend to rely on target-decoy searching to build empirical, dataset-specific models, which often leads to inaccurate FDR estimates. In this paper, we propose a new approach named common decoy distribution (CDD) to FDR estimation using the idea of a fixed empirical null score distribution derived from millions of peptide tandem mass spectra. To demonstrate the viability of CDD, its stability with respect to noise and the presence of unexpected peptide modifications was evaluated. PeptideProphet-based implementation of CDD was benchmarked against decoy-based PeptideProphet, and both methods exhibited similar accuracy of FDR estimates and retrieval of correct PSMs. The finding of this study calls for a re-evaluation of the necessity of dataset-specific target-decoy searches and illustrates the potential of Big Data approaches for statistical analysis in proteomics.
Collapse
|
7
|
Activity- and Enrichment-Based Metaproteomics Insights into Active Urease from the Rumen Microbiota of Cattle. Int J Mol Sci 2022; 23:ijms23020817. [PMID: 35055002 PMCID: PMC8776097 DOI: 10.3390/ijms23020817] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 12/25/2021] [Accepted: 01/04/2022] [Indexed: 11/27/2022] Open
Abstract
Regulation of microbial urease activity plays a crucial role in improving the utilization efficiency of urea and reducing nitrogen emissions to the environment for ruminant animals. Dealing with the diversity of microbial urease and identifying highly active urease as the target is the key for future regulation. However, the identification of active urease in the rumen is currently limited due to large numbers of uncultured microorganisms. In the present study, we describe an activity- and enrichment-based metaproteomic analysis as an approach for the discovery of highly active urease from the rumen microbiota of cattle. We conducted an optimization method of protein extraction and purification to obtain higher urease activity protein. Cryomilling was the best choice among the six applied protein extraction methods (ultrasonication, bead beating, cryomilling, high-pressure press, freeze-thawing, and protein extraction kit) for obtaining protein with high urease activity. The extracted protein by cryomilling was further enriched through gel filtration chromatography to obtain the fraction with the highest urease activity. Then, by using SDS-PAGE, the gel band including urease was excised and analyzed using LC-MS/MS, searching against a metagenome-derived protein database. Finally, we identified six microbial active ureases from 2225 rumen proteins, and the identified ureases were homologous to those of Fibrobacter and Treponema. Moreover, by comparing the 3D protein structures of the identified ureases and known ureases, we found that the residues in the β-turn of flap regions were nonconserved, which might be crucial in influencing the flexibility of flap regions and urease activity. In conclusion, the active urease from rumen microbes was identified by the approach of activity- and enrichment-based metaproteomics, which provides the target for designing a novel efficient urease inhibitor to regulate rumen microbial urease activity.
Collapse
|
8
|
Inferring early-life host and microbiome functions by mass spectrometry-based metaproteomics and metabolomics. Comput Struct Biotechnol J 2021; 20:274-286. [PMID: 35024099 PMCID: PMC8718658 DOI: 10.1016/j.csbj.2021.12.012] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Revised: 12/08/2021] [Accepted: 12/08/2021] [Indexed: 12/17/2022] Open
Abstract
Humans have a long-standing coexistence with microorganisms. In particular, the microbial community that populates the human gastrointestinal tract has emerged as a critical player in governing human health and disease. DNA and RNA sequencing techniques that map taxonomical composition and genomic potential of the gut community have become invaluable for microbiome research. However, deriving a biochemical understanding of how activities of the gut microbiome shape host development and physiology requires an expanded experimental design that goes beyond these approaches. In this review, we explore advances in high-throughput techniques based on liquid chromatography-mass spectrometry. These omics methods for the identification of proteins and metabolites have enabled direct characterisation of gut microbiome functions and the crosstalk with the host. We discuss current metaproteomics and metabolomics workflows for producing functional profiles, the existing methodological challenges and limitations, and recent studies utilising these techniques with a special focus on early life gut microbiome.
Collapse
|
9
|
Deep learning for peptide identification from metaproteomics datasets. J Proteomics 2021; 247:104316. [PMID: 34246788 DOI: 10.1016/j.jprot.2021.104316] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 06/02/2021] [Accepted: 06/18/2021] [Indexed: 10/20/2022]
Abstract
Metaproteomics is becoming widely used in microbiome research for gaining insights into the functional state of the microbial community. Current metaproteomics studies are generally based on high-throughput tandem mass spectrometry (MS/MS) coupled with liquid chromatography. In this paper, we proposed a deep-learning-based algorithm, named DeepFilter, for improving peptide identifications from a collection of tandem mass spectra. The key advantage of the DeepFilter is that it does not need ad hoc training or fine-tuning as in existing filtering tools. DeepFilter is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DeepFilter. SIGNIFICANCE: The identification of peptides and proteins from MS data involves the computational procedure of searching MS/MS spectra against a predefined protein sequence database and assigning top-scored peptides to spectra. Existing computational tools are still far from being able to extract all the information out of MS/MS data sets acquired from metaproteome samples. Systematical experiment results demonstrate that the DeepFilter identified up to 12% and 9% more peptide-spectrum-matches and proteins, respectively, compared with existing filtering algorithms, including Percolator, Q-ranker, PeptideProphet, and iProphet, on marine and soil microbial metaproteome samples with false discovery rate at 1%. The taxonomic analysis shows that DeepFilter found up to 7%, 10%, and 14% more species from marine, soil, and human gut samples compared with existing filtering algorithms. Therefore, DeepFilter was believed to generalize properly to new, previously unseen peptide-spectrum-matches and can be readily applied in peptide identification from metaproteomics data.
Collapse
|
10
|
A Critical Review of Bottom-Up Proteomics: The Good, the Bad, and the Future of this Field. Proteomes 2020; 8:proteomes8030014. [PMID: 32640657 PMCID: PMC7564415 DOI: 10.3390/proteomes8030014] [Citation(s) in RCA: 121] [Impact Index Per Article: 30.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Revised: 06/25/2020] [Accepted: 07/01/2020] [Indexed: 02/07/2023] Open
Abstract
Proteomics is the field of study that includes the analysis of proteins, from either a basic science prospective or a clinical one. Proteins can be investigated for their abundance, variety of proteoforms due to post-translational modifications (PTMs), and their stable or transient protein–protein interactions. This can be especially beneficial in the clinical setting when studying proteins involved in different diseases and conditions. Here, we aim to describe a bottom-up proteomics workflow from sample preparation to data analysis, including all of its benefits and pitfalls. We also describe potential improvements in this type of proteomics workflow for the future.
Collapse
|
11
|
Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. MASS SPECTROMETRY REVIEWS 2020; 39:292-306. [PMID: 28902424 DOI: 10.1002/mas.21543] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 07/05/2017] [Indexed: 06/07/2023]
Abstract
Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines.
Collapse
|
12
|
Bioinformatics Methods for Mass Spectrometry-Based Proteomics Data Analysis. Int J Mol Sci 2020; 21:ijms21082873. [PMID: 32326049 PMCID: PMC7216093 DOI: 10.3390/ijms21082873] [Citation(s) in RCA: 109] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 04/16/2020] [Accepted: 04/18/2020] [Indexed: 01/15/2023] Open
Abstract
Recent advances in mass spectrometry (MS)-based proteomics have enabled tremendous progress in the understanding of cellular mechanisms, disease progression, and the relationship between genotype and phenotype. Though many popular bioinformatics methods in proteomics are derived from other omics studies, novel analysis strategies are required to deal with the unique characteristics of proteomics data. In this review, we discuss the current developments in the bioinformatics methods used in proteomics and how they facilitate the mechanistic understanding of biological processes. We first introduce bioinformatics software and tools designed for mass spectrometry-based protein identification and quantification, and then we review the different statistical and machine learning methods that have been developed to perform comprehensive analysis in proteomics studies. We conclude with a discussion of how quantitative protein data can be used to reconstruct protein interactions and signaling networks.
Collapse
|
13
|
Abstract
The processing of peptide tandem mass spectrometry data involves matching observed spectra against a sequence database. The ranking and calibration of these peptide-spectrum matches can be improved substantially using a machine learning postprocessor. Here, we describe our efforts to speed up one widely used postprocessor, Percolator. The improved software is dramatically faster than the previous version of Percolator, even when using relatively few processors. We tested the new version of Percolator on a data set containing over 215 million spectra and recorded an overall reduction to 23% of the running time as compared to the unoptimized code. We also show that the memory footprint required by these speedups is modest relative to that of the original version of Percolator.
Collapse
|
14
|
Target-small decoy search strategy for false discovery rate estimation. BMC Bioinformatics 2019; 20:438. [PMID: 31443634 PMCID: PMC6708216 DOI: 10.1186/s12859-019-3034-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 08/19/2019] [Indexed: 11/24/2022] Open
Abstract
Background One of the most important steps in peptide identification is to estimate the false discovery rate (FDR). The most commonly used method for estimating FDR is the target-decoy search strategy (TDS). While this method is simple and effective, it is time/space-inefficient because it searches a database that is twice as large as the original protein database. This inefficiency problem becomes more evident as protein databases get bigger and bigger. We propose a target-small decoy search strategy and present a rigorous verification that it reduces the database size and search time while retaining the accuracy of target-decoy search strategy (TDS). Results We show that peptide spectrum matches (PSMs) obtained at 1% FDR in TDS overlap ~ 99% with those in our method. (Considering that 1% FDR is used, 99% overlap means our method is very accurate.) Moreover, our method is more time/space-efficient than TDS. The search time of our method is reduced to only 1/4 of that of TDS when UniProt and its 1/8 decoy database are used. Conclusions We demonstrate that our method is almost as accurate as TDS and more time/space-efficient than TDS. Since the efficiency of our method is more evident as the database size increases, our method is expected to be useful for identifying peptides in proteogenomics databases constructed from inflated databases using genomic data. Electronic supplementary material The online version of this article (10.1186/s12859-019-3034-8) contains supplementary material, which is available to authorized users.
Collapse
|
15
|
Challenges and promise at the interface of metaproteomics and genomics: an overview of recent progress in metaproteogenomic data analysis. Expert Rev Proteomics 2019; 16:375-390. [PMID: 31002542 DOI: 10.1080/14789450.2019.1609944] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
INTRODUCTION The study of microbial communities based on the combined analysis of genomic and proteomic data - called metaproteogenomics - has gained increased research attention in recent years. This relatively young field aims to elucidate the functional and taxonomic interplay of proteins in microbiomes and its implications on human health and the environment. Areas covered: This article reviews bioinformatics methods and software tools dedicated to the analysis of data from metaproteomics and metaproteogenomics experiments. In particular, it focuses on the creation of tailored protein sequence databases, on the optimal use of database search algorithms including methods of error rate estimation, and finally on taxonomic and functional annotation of peptide and protein identifications. Expert opinion: Recently, various promising strategies and software tools have been proposed for handling typical data analysis issues in metaproteomics. However, severe challenges remain that are highlighted and discussed in this article; these include: (i) robust false-positive assessment of peptide and protein identifications, (ii) complex protein inference against a background of highly redundant data, (iii) taxonomic and functional post-processing of identification data, and finally, (iv) the assessment and provision of metrics and tools for quantitative analysis.
Collapse
|
16
|
Challenges in Clinical Metaproteomics Highlighted by the Analysis of Acute Leukemia Patients with Gut Colonization by Multidrug-Resistant Enterobacteriaceae. Proteomes 2019; 7:proteomes7010002. [PMID: 30626002 PMCID: PMC6473847 DOI: 10.3390/proteomes7010002] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 12/20/2018] [Accepted: 01/03/2019] [Indexed: 12/25/2022] Open
Abstract
The microbiome has a strong impact on human health and disease and is, therefore, increasingly studied in a clinical context. Metaproteomics is also attracting considerable attention, and such data can be efficiently generated today owing to improvements in mass spectrometry-based proteomics. As we will discuss in this study, there are still major challenges notably in data analysis that need to be overcome. Here, we analyzed 212 fecal samples from 56 hospitalized acute leukemia patients with multidrug-resistant Enterobactericeae (MRE) gut colonization using metagenomics and metaproteomics. This is one of the largest clinical metaproteomic studies to date, and the first metaproteomic study addressing the gut microbiome in MRE colonized acute leukemia patients. Based on this substantial data set, we discuss major current limitations in clinical metaproteomic data analysis to provide guidance to researchers in the field. Notably, the results show that public metagenome databases are incomplete and that sample-specific metagenomes improve results. Furthermore, biological variation is tremendous which challenges clinical study designs and argues that longitudinal measurements of individual patients are a valuable future addition to the analysis of patient cohorts.
Collapse
|
17
|
Metaproteomics: Sample Preparation and Methodological Considerations. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1073:187-215. [DOI: 10.1007/978-3-030-12298-0_8] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
18
|
|
19
|
Proteogenomics from a bioinformatics angle: A growing field. MASS SPECTROMETRY REVIEWS 2017; 36:584-599. [PMID: 26670565 PMCID: PMC6101030 DOI: 10.1002/mas.21483] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/01/2015] [Indexed: 05/16/2023]
Abstract
Proteogenomics is a research area that combines areas as proteomics and genomics in a multi-omics setup using both mass spectrometry and high-throughput sequencing technologies. Currently, the main goals of the field are to aid genome annotation or to unravel the proteome complexity. Mass spectrometry based identifications of matching or homologues peptides can further refine gene models. Also, the identification of novel proteoforms is also made possible based on detection of novel translation initiation sites (cognate or near-cognate), novel transcript isoforms, sequence variation or novel (small) open reading frames in intergenic or un-translated genic regions by analyzing high-throughput sequencing data from RNAseq or ribosome profiling experiments. Other proteogenomics studies using a combination of proteomics and genomics techniques focus on antibody sequencing, the identification of immunogenic peptides or venom peptides. Over the years, a growing amount of bioinformatics tools and databases became available to help streamlining these cross-omics studies. Some of these solutions only help in specific steps of the proteogenomics studies, e.g. building custom sequence databases (based on next generation sequencing output) for mass spectrometry fragmentation spectrum matching. Over the last few years a handful integrative tools also became available that can execute complete proteogenomics analyses. Some of these are presented as stand-alone solutions, whereas others are implemented in a web-based framework such as Galaxy. In this review we aimed at sketching a comprehensive overview of all the bioinformatics solutions that are available for this growing research area. © 2015 Wiley Periodicals, Inc. Mass Spec Rev 36:584-599, 2017.
Collapse
|
20
|
Challenges and perspectives of metaproteomic data analysis. J Biotechnol 2017; 261:24-36. [PMID: 28663049 DOI: 10.1016/j.jbiotec.2017.06.1201] [Citation(s) in RCA: 88] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Revised: 06/20/2017] [Accepted: 06/23/2017] [Indexed: 02/07/2023]
Abstract
In nature microorganisms live in complex microbial communities. Comprehensive taxonomic and functional knowledge about microbial communities supports medical and technical application such as fecal diagnostics as well as operation of biogas plants or waste water treatment plants. Furthermore, microbial communities are crucial for the global carbon and nitrogen cycle in soil and in the ocean. Among the methods available for investigation of microbial communities, metaproteomics can approximate the activity of microorganisms by investigating the protein content of a sample. Although metaproteomics is a very powerful method, issues within the bioinformatic evaluation impede its success. In particular, construction of databases for protein identification, grouping of redundant proteins as well as taxonomic and functional annotation pose big challenges. Furthermore, growing amounts of data within a metaproteomics study require dedicated algorithms and software. This review summarizes recent metaproteomics software and addresses the introduced issues in detail.
Collapse
|
21
|
Abstract
Scoring functions that assess spectrum similarity play a crucial role in many computational mass spectrometry algorithms. These functions are used to compare an experimentally acquired fragmentation (MS/MS) spectrum against two different types of target MS/MS spectra: either against a theoretical MS/MS spectrum derived from a peptide from a sequence database, or against another, previously acquired MS/MS spectrum. The former is typically encountered in database searching, while the latter is used in spectrum clustering and spectral library searching. The comparison between acquired versus theoretical MS/MS spectra is most commonly performed using cross-correlations or probability derived scoring functions, while the comparison of two acquired MS/MS spectra typically makes use of a normalized dot product, especially in spectrum library search algorithms. In addition to these scoring functions, Pearson's or Spearman's correlation coefficients, mean squared error, or median absolute deviation scores can also be used for the same purpose. Here, we describe and evaluate these scoring functions with regards to their ability to assess spectrum similarity for theoretical versus acquired, and acquired versus acquired spectra.
Collapse
|
22
|
Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2016; 27:1719-1727. [PMID: 27572102 PMCID: PMC5059416 DOI: 10.1007/s13361-016-1460-7] [Citation(s) in RCA: 225] [Impact Index Per Article: 28.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 06/15/2016] [Accepted: 07/20/2016] [Indexed: 05/21/2023]
Abstract
Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator's processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore, with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method-grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein-in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542). The source code and Ubuntu, Windows, MacOS, and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license. Graphical Abstract ᅟ.
Collapse
|
23
|
Abstract
Omics approaches have become popular in biology as powerful discovery tools, and currently gain in interest for diagnostic applications. Establishing the accurate genome sequence of any organism is easy, but the outcome of its annotation by means of automatic pipelines remains imprecise. Some protein-encoding genes may be missed as soon as they are specific and poorly conserved in a given taxon, while important to explain the specific traits of the organism. Translational starts are also poorly predicted in a relatively important number of cases, thus impacting the protein sequence database used in proteomics, comparative genomics, and systems biology. The use of high-throughput proteomics data to improve genome annotation is an attractive option to obtain a more comprehensive molecular picture of a given organism. Here, protocols for reannotating prokaryote genomes are described based on shotgun proteomics and derivatization of protein N-termini with a positively charged reagent coupled to high-resolution tandem mass spectrometry.
Collapse
|
24
|
Metaproteomic data analysis at a glance: advances in computational microbial community proteomics. Expert Rev Proteomics 2016; 13:757-69. [DOI: 10.1080/14789450.2016.1209418] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
25
|
Next-generation proteomics faces new challenges in environmental biotechnology. Curr Opin Biotechnol 2016; 38:174-82. [DOI: 10.1016/j.copbio.2016.02.025] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
26
|
|