1
|
Semi-Supervised Learning Using Hierarchical Mixture Models: Gene Essentiality Case Study. MATHEMATICAL AND COMPUTATIONAL APPLICATIONS 2021. [DOI: 10.3390/mca26020040] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Integrating gene-level data is useful for predicting the role of genes in biological processes. This problem has typically focused on supervised classification, which requires large training sets of positive and negative examples. However, training data sets that are too small for supervised approaches can still provide valuable information. We describe a hierarchical mixture model that uses limited positively labeled gene training data for semi-supervised learning. We focus on the problem of predicting essential genes, where a gene is required for the survival of an organism under particular conditions. We applied cross-validation and found that the inclusion of positively labeled samples in a semi-supervised learning framework with the hierarchical mixture model improves the detection of essential genes compared to unsupervised, supervised, and other semi-supervised approaches. There was also improved prediction performance when genes are incorrectly assumed to be non-essential. Our comparisons indicate that the incorporation of even small amounts of existing knowledge improves the accuracy of prediction and decreases variability in predictions. Although we focused on gene essentiality, the hierarchical mixture model and semi-supervised framework is standard for problems focused on prediction of genes or other features, with multiple data types characterizing the feature, and a small set of positive labels.
Collapse
|
2
|
Abstract
In proteomics, identification of proteins from complex mixtures of proteins extracted from biological samples is an important problem. Among the experimental technologies, Mass-Spectrometry (MS) is the most popular one. Protein identification from MS data typically relies on a "two-step" procedure of identifying the peptide first followed by the separate protein identification procedure next. In this setup, the interdependence of peptides and proteins are neglected resulting in relatively inaccurate protein identification. In this article, we propose a Markov chain Monte Carlo (MCMC) based Bayesian hierarchical model, a first of its kind in protein identification, which integrates the two steps and performs joint analysis of proteins and peptides using posterior probabilities. We remove the assumption of independence of proteins by using clustering group priors to the proteins based on the assumption that proteins sharing the same biological pathway are likely to be present or absent together and are correlated. The complete conditionals of the proposed joint model being tractable, we propose and implement a Gibbs sampling scheme for full posterior inference that provides the estimation and statistical uncertainties of all relevant parameters. The model has better operational characteristics compared to two existing "one-step" procedures on a range of simulation settings as well as on two well-studied datasets.
Collapse
Affiliation(s)
- Riten Mitra
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202
| | - Ryan Gill
- Department of Mathematics, University of Louisville, Louisville, KY 40292
| | - Sinjini Sikdar
- Department of Biostatistics, University of Florida, Gainesville, FL 32611
| | - Susmita Datta
- Department of Biostatistics, University of Florida, Gainesville, FL 32611
| |
Collapse
|
3
|
Zhong J, Wang J, Ding X, Zhang Z, Li M, Wu FX, Pan Y. Protein Inference from the Integration of Tandem MS Data and Interactome Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1399-1409. [PMID: 28113634 DOI: 10.1109/tcbb.2016.2601618] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Since proteins are digested into a mixture of peptides in the preprocessing step of tandem mass spectrometry (MS), it is difficult to determine which specific protein a shared peptide belongs to. In recent studies, besides tandem MS data and peptide identification information, some other information is exploited to infer proteins. Different from the methods which first use only tandem MS data to infer proteins and then use network information to refine them, this study proposes a protein inference method named TMSIN, which uses interactome networks directly. As two interacting proteins should co-exist, it is reasonable to assume that if one of the interacting proteins is confidently inferred in a sample, its interacting partners should have a high probability in the same sample, too. Therefore, we can use the neighborhood information of a protein in an interactome network to adjust the probability that the shared peptide belongs to the protein. In TMSIN, a multi-weighted graph is constructed by incorporating the bipartite graph with interactome network information, where the bipartite graph is built with the peptide identification information. Based on multi-weighted graphs, TMSIN adopts an iterative workflow to infer proteins. At each iterative step, the probability that a shared peptide belongs to a specific protein is calculated by using the Bayes' law based on the neighbor protein support scores of each protein which are mapped by the shared peptides. We carried out experiments on yeast data and human data to evaluate the performance of TMSIN in terms of ROC, q-value, and accuracy. The experimental results show that AUC scores yielded by TMSIN are 0.742 and 0.874 in yeast dataset and human dataset, respectively, and TMSIN yields the maximum number of true positives when q-value less than or equal to 0.05. The overlap analysis shows that TMSIN is an effective complementary approach for protein inference.
Collapse
|
4
|
Kong A, Azencott R. Binary Markov Random Fields and interpretable mass spectra discrimination. Stat Appl Genet Mol Biol 2017; 16:/j/sagmb.ahead-of-print/sagmb-2016-0019/sagmb-2016-0019.xml. [PMID: 28475101 DOI: 10.1515/sagmb-2016-0019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
For mass spectra acquired from cancer patients by MALDI or SELDI techniques, automated discrimination between cancer types or stages has often been implemented by machine learning algorithms. Nevertheless, these techniques typically lack interpretability in terms of biomarkers. In this paper, we propose a new mass spectra discrimination algorithm by parameterized Markov Random Fields to automatically generate interpretable classifiers with small groups of scored biomarkers. A dataset of 238 MALDI colorectal mass spectra and two datasets of 216 and 253 SELDI ovarian mass spectra respectively were used to test our approach. The results show that our approach reaches accuracies of 81% to 100% to discriminate between patients from different colorectal and ovarian cancer stages, and performs as well or better than previous studies on similar datasets. Moreover, our approach enables efficient planar-displays to visualize mass spectra discrimination and has good asymptotic performance for large datasets. Thus, our classifiers should facilitate the choice and planning of further experiments for biological interpretation of cancer discriminating signatures. In our experiments, the number of mass spectra for each colorectal cancer stage is roughly half of that for each ovarian cancer stage, so that we reach lower discrimination accuracy for colorectal cancer than for ovarian cancer.
Collapse
|
5
|
Sikdar S, Gill R, Datta S. Improving protein identification from tandem mass spectrometry data by one-step methods and integrating data from other platforms. Brief Bioinform 2015; 17:262-9. [PMID: 26141827 DOI: 10.1093/bib/bbv043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2015] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION Many approaches have been proposed for the protein identification problem based on tandem mass spectrometry (MS/MS) data. In these experiments, proteins are digested into peptides and the resulting peptide mixture is subjected to mass spectrometry. Some interesting putative peptide features (peaks) are selected from the mass spectra. Following that, the precursor ions undergo fragmentation and are analyzed by MS/MS. The process of identification of peptides from the mass spectra and the constituent proteins in the sample is called protein identification from MS/MS data. There are many two-step protein identification procedures, reviewed in the literature, which first attempt to identify the peptides in a separate process and then use these results to infer the proteins. However, in recent years, there have been attempts to provide a one-step solution to protein identification, which simultaneously identifies the proteins and the peptides in the sample. RESULTS In this review, we briefly introduce the most popular two-step protein identification procedure, PeptideProphet coupled with ProteinProphet. Following that, we describe the difficulties with two-step procedures and review some recently introduced one-step protein/peptide identification procedures that do not suffer from these issues. The focus of this review is on one-step procedures that are based on statistical likelihood-based models, but some discussion of other one-step procedures is also included. We report comparative performances of one-step and two-step methods, which support the overall superiorities of one-step procedures. We also cover some recent efforts to improve protein identification by incorporating other molecular data along with MS/MS data.
Collapse
|
6
|
Ryu SY. Bioinformatics tools to identify and quantify proteins using mass spectrometry data. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2014; 94:1-17. [PMID: 24629183 DOI: 10.1016/b978-0-12-800168-4.00001-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Proteomics tries to understand biological function of an organism by studying its protein expressions. Mass spectrometry is used in the field of shotgun proteomics, and it generates mass spectra that are used to identify and quantify proteins in biological samples. In this chapter, we discuss the bioinformatics algorithms to analyze mass spectrometry data. After briefly describing how mass spectrometry generates data, we illustrate the bioinformatics algorithms and software for protein identification such as de novo approach and database-searching approach. We also discuss the bioinformatics algorithms and software to quantify proteins and detect the differential proteins using isotope-coded affinity tags and label-free mass spectrometry data.
Collapse
Affiliation(s)
- So Young Ryu
- Stanford Genome Technology Center, Biochemistry Department, Stanford University, Stanford, California, USA.
| |
Collapse
|
7
|
Yang C, He Z, Yu W. A combinatorial perspective of the protein inference problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1542-1547. [PMID: 24407311 DOI: 10.1109/tcbb.2013.110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In a shotgun proteomics experiment, proteins are the most biologically meaningful output. The success of proteomics studies depends on the ability to accurately and efficiently identify proteins. Many methods have been proposed to facilitate the identification of proteins from peptide identification results. However, the relationship between protein identification and peptide identification has not been thoroughly explained before. In this paper, we devote ourselves to a combinatorial perspective of the protein inference problem. We employ combinatorial mathematics to calculate the conditional protein probabilities (protein probability means the probability that a protein is correctly identified) under three assumptions, which lead to a lower bound, an upper bound, and an empirical estimation of protein probabilities, respectively. The combinatorial perspective enables us to obtain an analytical expression for protein inference. Our method achieves comparable results with ProteinProphet in a more efficient manner in experiments on two data sets of standard protein mixtures and two data sets of real samples. Based on our model, we study the impact of unique peptides and degenerate peptides (degenerate peptides are peptides shared by at least two proteins) on protein probabilities. Meanwhile, we also study the relationship between our model and ProteinProphet. We name our program ProteinInfer. Its Java source code, our supplementary document and experimental results are available at: >http://bioinformatics.ust.hk/proteininfer.
Collapse
Affiliation(s)
- Chao Yang
- The Hong Kong University of Science and Technology, Hong Kong
| | | | - Weichuan Yu
- The Hong Kong University of Science and Technology, Hong Kong
| |
Collapse
|
8
|
Huang T, Gong H, Yang C, He Z. ProteinLasso: A Lasso regression approach to protein inference problem in shotgun proteomics. Comput Biol Chem 2013; 43:46-54. [PMID: 23385215 DOI: 10.1016/j.compbiolchem.2012.12.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2012] [Revised: 12/30/2012] [Accepted: 12/30/2012] [Indexed: 11/28/2022]
Abstract
Protein inference is an important issue in proteomics research. Its main objective is to select a proper subset of candidate proteins that best explain the observed peptides. Although many methods have been proposed for solving this problem, several issues such as peptide degeneracy and one-hit wonders still remain unsolved. Therefore, the accurate identification of proteins that are truly present in the sample continues to be a challenging task. Based on the concept of peptide detectability, we formulate the protein inference problem as a constrained Lasso regression problem, which can be solved very efficiently through a coordinate descent procedure. The new inference algorithm is named as ProteinLasso, which explores an ensemble learning strategy to address the sparsity parameter selection problem in Lasso model. We test the performance of ProteinLasso on three datasets. As shown in the experimental results, ProteinLasso outperforms those state-of-the-art protein inference algorithms in terms of both identification accuracy and running efficiency. In addition, we show that ProteinLasso is stable under different parameter specifications. The source code of our algorithm is available at: http://sourceforge.net/projects/proteinlasso.
Collapse
Affiliation(s)
- Ting Huang
- School of Software, Dalian University of Technology, China
| | | | | | | |
Collapse
|
9
|
Shi J, Chen B, Wu FX. Unifying protein inference and peptide identification with feedback to update consistency between peptides. Proteomics 2013; 13:239-247. [PMID: 23111981 DOI: 10.1002/pmic.201200338] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2012] [Revised: 10/07/2012] [Accepted: 10/11/2012] [Indexed: 11/11/2022]
Abstract
We first propose a new method to process peptide identification reports from databases search engines. Then via it we develop a method for unifying protein inference and peptide identification by adding a feedback from protein inference to peptide identification. The feedback information is a list of high-confidence proteins, which is used to update an adjacency matrix between peptides. The adjacency matrix is used in the regularization of peptide scores. Logistic regression (LR) is used to compute the probability of peptide identification with the regularized scores. Protein scores are then calculated with the LR probability of peptides. Instead of selecting the best peptide match for each MS/MS, we select multiple peptides. By testing on two datasets, the results have shown that the proposed method can robustly assign accurate probabilities to peptides, and have a higher discrimination power than PeptideProphet to distinguish correct and incorrect identified peptides. Additionally, not only can our method infer more true positive proteins but also infer less false positive proteins than ProteinProphet at the same false positive rate. The coverage of inferred proteins is also significantly increased due to the selection of multiple peptides for each MS/MS and the improvement of their scores by the feedback from the inferred proteins.
Collapse
Affiliation(s)
- Jinhong Shi
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | | | | |
Collapse
|
10
|
Li Q, Eng JK, Stephens M. A likelihood-based scoring method for peptide identification using mass spectrometry. Ann Appl Stat 2012. [DOI: 10.1214/12-aoas568] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Shi J, Wu FX. A feedback framework for protein inference with peptides identified from tandem mass spectra. Proteome Sci 2012; 10:68. [PMID: 23164319 PMCID: PMC3776439 DOI: 10.1186/1477-5956-10-68] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2012] [Accepted: 11/02/2012] [Indexed: 11/10/2022] Open
Abstract
UNLABELLED BACKGROUND Protein inference is an important computational step in proteomics. There exists a natural nest relationship between protein inference and peptide identification, but these two steps are usually performed separately in existing methods. We believe that both peptide identification and protein inference can be improved by exploring such nest relationship. RESULTS In this study, a feedback framework is proposed to process peptide identification reports from search engines, and an iterative method is implemented to exemplify the processing of Sequest peptide identification reports according to the framework. The iterative method is verified on two datasets with known validity of proteins and peptides, and compared with ProteinProphet and PeptideProphet. The results have shown that not only can the iterative method infer more true positive and less false positive proteins than ProteinProphet, but also identify more true positive and less false positive peptides than PeptideProphet. CONCLUSIONS The proposed iterative method implemented according to the feedback framework can unify and improve the results of peptide identification and protein inference.
Collapse
Affiliation(s)
- Jinhong Shi
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Dr, Saskatoon, Canada.
| | | |
Collapse
|
12
|
Huang T, He Z. A linear programming model for protein inference problem in shotgun proteomics. ACTA ACUST UNITED AC 2012; 28:2956-62. [PMID: 22954624 DOI: 10.1093/bioinformatics/bts540] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
MOTIVATION Assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is an important issue in shotgun proteomics. The objective of protein inference is to find a subset of proteins that are truly present in the sample. Although many methods have been proposed for protein inference, several issues such as peptide degeneracy still remain unsolved. RESULTS In this article, we present a linear programming model for protein inference. In this model, we use a transformation of the joint probability that each peptide/protein pair is present in the sample as the variable. Then, both the peptide probability and protein probability can be expressed as a formula in terms of the linear combination of these variables. Based on this simple fact, the protein inference problem is formulated as an optimization problem: minimize the number of proteins with non-zero probabilities under the constraint that the difference between the calculated peptide probability and the peptide probability generated from peptide identification algorithms should be less than some threshold. This model addresses the peptide degeneracy issue by forcing some joint probability variables involving degenerate peptides to be zero in a rigorous manner. The corresponding inference algorithm is named as ProteinLP. We test the performance of ProteinLP on six datasets. Experimental results show that our method is competitive with the state-of-the-art protein inference algorithms. AVAILABILITY The source code of our algorithm is available at: https://sourceforge.net/projects/prolp/. CONTACT zyhe@dlut.edu.cn. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics Online.
Collapse
Affiliation(s)
- Ting Huang
- School of Software, Dalian University of Technology, Dalian 116621, China
| | | |
Collapse
|
13
|
|
14
|
Ahrens CH, Brunner E, Qeli E, Basler K, Aebersold R. Generating and navigating proteome maps using mass spectrometry. Nat Rev Mol Cell Biol 2010; 11:789-801. [DOI: 10.1038/nrm2973] [Citation(s) in RCA: 137] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
15
|
Nesvizhskii AI. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010; 73:2092-123. [PMID: 20816881 DOI: 10.1016/j.jprot.2010.08.009] [Citation(s) in RCA: 372] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2010] [Revised: 08/25/2010] [Accepted: 08/25/2010] [Indexed: 12/18/2022]
Abstract
This manuscript provides a comprehensive review of the peptide and protein identification process using tandem mass spectrometry (MS/MS) data generated in shotgun proteomic experiments. The commonly used methods for assigning peptide sequences to MS/MS spectra are critically discussed and compared, from basic strategies to advanced multi-stage approaches. A particular attention is paid to the problem of false-positive identifications. Existing statistical approaches for assessing the significance of peptide to spectrum matches are surveyed, ranging from single-spectrum approaches such as expectation values to global error rate estimation procedures such as false discovery rates and posterior probabilities. The importance of using auxiliary discriminant information (mass accuracy, peptide separation coordinates, digestion properties, and etc.) is discussed, and advanced computational approaches for joint modeling of multiple sources of information are presented. This review also includes a detailed analysis of the issues affecting the interpretation of data at the protein level, including the amplification of error rates when going from peptide to protein level, and the ambiguities in inferring the identifies of sample proteins in the presence of shared peptides. Commonly used methods for computing protein-level confidence scores are discussed in detail. The review concludes with a discussion of several outstanding computational issues.
Collapse
|
16
|
Protein and gene model inference based on statistical modeling in k-partite graphs. Proc Natl Acad Sci U S A 2010; 107:12101-6. [PMID: 20562346 DOI: 10.1073/pnas.0907654107] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
One of the major goals of proteomics is the comprehensive and accurate description of a proteome. Shotgun proteomics, the method of choice for the analysis of complex protein mixtures, requires that experimentally observed peptides are mapped back to the proteins they were derived from. This process is also known as protein inference. We present Markovian Inference of Proteins and Gene Models (MIPGEM), a statistical model based on clearly stated assumptions to address the problem of protein and gene model inference for shotgun proteomics data. In particular, we are dealing with dependencies among peptides and proteins using a Markovian assumption on k-partite graphs. We are also addressing the problems of shared peptides and ambiguous proteins by scoring the encoding gene models. Empirical results on two control datasets with synthetic mixtures of proteins and on complex protein samples of Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana suggest that the results with MIPGEM are competitive with existing tools for protein inference.
Collapse
|