1
|
Fishilevich S, Zimmerman S, Kohn A, Iny Stein T, Olender T, Kolker E, Safran M, Lancet D. Genic insights from integrated human proteomics in GeneCards. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw030. [PMID: 27048349 PMCID: PMC4820835 DOI: 10.1093/database/baw030] [Citation(s) in RCA: 102] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Accepted: 02/23/2016] [Indexed: 11/15/2022]
Abstract
GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite’s next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein–RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL: http://www.genecards.org/
Collapse
Affiliation(s)
- Simon Fishilevich
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Shahar Zimmerman
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Asher Kohn
- LifeMap Sciences Ltd., Tel Aviv 69710, Israel
| | - Tsippi Iny Stein
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Tsviya Olender
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Eugene Kolker
- CDO Analytics, Seattle Children's Hospital, Seattle, WA 98101 USA Bioinformatics and High-Throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, WA 98101 USA Data-Enabled Life Sciences Alliance (DELSA), Seattle, Washington, 98101, USA Departments of Biomedical Informatics and Medical Education and Pediatrics, University of Washington School of Medicine, Seattle, WA 98109, USA Department of Chemistry and Chemical Biology, Northeastern University College of Science, Boston, MA 02115 USA
| | - Marilyn Safran
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Doron Lancet
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| |
Collapse
|
2
|
Higdon R, Earl RK, Stanberry L, Hudac CM, Montague E, Stewart E, Janko I, Choiniere J, Broomall W, Kolker N, Bernier RA, Kolker E. The promise of multi-omics and clinical data integration to identify and target personalized healthcare approaches in autism spectrum disorders. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2016; 19:197-208. [PMID: 25831060 DOI: 10.1089/omi.2015.0020] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Complex diseases are caused by a combination of genetic and environmental factors, creating a difficult challenge for diagnosis and defining subtypes. This review article describes how distinct disease subtypes can be identified through integration and analysis of clinical and multi-omics data. A broad shift toward molecular subtyping of disease using genetic and omics data has yielded successful results in cancer and other complex diseases. To determine molecular subtypes, patients are first classified by applying clustering methods to different types of omics data, then these results are integrated with clinical data to characterize distinct disease subtypes. An example of this molecular-data-first approach is in research on Autism Spectrum Disorder (ASD), a spectrum of social communication disorders marked by tremendous etiological and phenotypic heterogeneity. In the case of ASD, omics data such as exome sequences and gene and protein expression data are combined with clinical data such as psychometric testing and imaging to enable subtype identification. Novel ASD subtypes have been proposed, such as CHD8, using this molecular subtyping approach. Broader use of molecular subtyping in complex disease research is impeded by data heterogeneity, diversity of standards, and ineffective analysis tools. The future of molecular subtyping for ASD and other complex diseases calls for an integrated resource to identify disease mechanisms, classify new patients, and inform effective treatment options. This in turn will empower and accelerate precision medicine and personalized healthcare.
Collapse
Affiliation(s)
- Roger Higdon
- 1 Bioinformatics and High-Throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
3
|
Higdon R, Stewart E, Stanberry L, Haynes W, Choiniere J, Montague E, Anderson N, Yandl G, Janko I, Broomall W, Fishilevich S, Lancet D, Kolker N, Kolker E. MOPED enables discoveries through consistently processed proteomics data. J Proteome Res 2013; 13:107-13. [PMID: 24350770 DOI: 10.1021/pr400884c] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org) is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm, and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project's efforts to generate chromosome- and diseases-specific proteomes by providing links from proteins to chromosome and disease information as well as many complementary resources. MOPED supports a new omics metadata checklist to harmonize data integration, analysis, and use. MOPED's development is driven by the user community, which spans 90 countries and guides future development that will transform MOPED into a multiomics resource. MOPED encourages users to submit data in a simple format. They can use the metadata checklist to generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.
Collapse
|
4
|
Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N, Kolker E. Unraveling the Complexities of Life Sciences Data. BIG DATA 2013; 1:42-50. [PMID: 27447037 DOI: 10.1089/big.2012.1505] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.
Collapse
Affiliation(s)
- Roger Higdon
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Winston Haynes
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Larissa Stanberry
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Elizabeth Stewart
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Gregory Yandl
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Chris Howard
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 5 Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
| | - William Broomall
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Natali Kolker
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Eugene Kolker
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 6 Departments of Biomedical Informatics & Medical Education and Pediatrics, University of Washington , Seattle, Washington
| |
Collapse
|
5
|
Yadav AK, Kumar D, Dash D. Learning from decoys to improve the sensitivity and specificity of proteomics database search results. PLoS One 2012. [PMID: 23189209 PMCID: PMC3506577 DOI: 10.1371/journal.pone.0050651] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.
Collapse
Affiliation(s)
- Amit Kumar Yadav
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Dhirendra Kumar
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
| | - Debasis Dash
- GNR Knowledge Center for Genome Informatics, CSIR-Institute of Genomics and Integrative Biology, Delhi, India
- * E-mail:
| |
Collapse
|
6
|
Higdon R, Reiter L, Hather G, Haynes W, Kolker N, Stewart E, Bauman AT, Picotti P, Schmidt A, van Belle G, Aebersold R, Kolker E. IPM: An integrated protein model for false discovery rate estimation and identification in high-throughput proteomics. J Proteomics 2011; 75:116-21. [PMID: 21718813 DOI: 10.1016/j.jprot.2011.06.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2011] [Revised: 05/28/2011] [Accepted: 06/02/2011] [Indexed: 12/19/2022]
Abstract
In high-throughput mass spectrometry proteomics, peptides and proteins are not simply identified as present or not present in a sample, rather the identifications are associated with differing levels of confidence. The false discovery rate (FDR) has emerged as an accepted means for measuring the confidence associated with identifications. We have developed the Systematic Protein Investigative Research Environment (SPIRE) for the purpose of integrating the best available proteomics methods. Two successful approaches to estimating the FDR for MS protein identifications are the MAYU and our current SPIRE methods. We present here a method to combine these two approaches to estimating the FDR for MS protein identifications into an integrated protein model (IPM). We illustrate the high quality performance of this IPM approach through testing on two large publicly available proteomics datasets. MAYU and SPIRE show remarkable consistency in identifying proteins in these datasets. Still, IPM results in a more robust FDR estimation approach and additional identifications, particularly among low abundance proteins. IPM is now implemented as a part of the SPIRE system.
Collapse
Affiliation(s)
- Roger Higdon
- Bioinformatics & High-throughput Analysis Laboratory, Seattle, WA, USA.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Bauman A, Higdon R, Rapson S, Loiue B, Hogan J, Stacy R, Napuli A, Guo W, van Voorhis W, Roach J, Lu V, Landorf E, Stewart E, Kolker N, Collart F, Myler P, van Belle G, Kolker E. Design and initial characterization of the SC-200 proteomics standard mixture. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2011; 15:73-82. [PMID: 21250827 DOI: 10.1089/omi.2010.0118] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
High-throughput (HTP) proteomics studies generate large amounts of data. Interpretation of these data requires effective approaches to distinguish noise from biological signal, particularly as instrument and computational capacity increase and studies become more complex. Resolving this issue requires validated and reproducible methods and models, which in turn requires complex experimental and computational standards. The absence of appropriate standards and data sets for validating experimental and computational workflows hinders the development of HTP proteomics methods. Most protein standards are simple mixtures of proteins or peptides, or undercharacterized reference standards in which the identity and concentration of the constituent proteins is unknown. The Seattle Children's 200 (SC-200) proposed proteomics standard mixture is the next step toward developing realistic, fully characterized HTP proteomics standards. The SC-200 exhibits a unique modular design to extend its functionality, and consists of 200 proteins of known identities and molar concentrations from 6 microbial genomes, distributed into 10 molar concentration tiers spanning a 1,000-fold range. We describe the SC-200's design, potential uses, and initial characterization. We identified 84% of SC-200 proteins with an LTQ-Orbitrap and 65% with an LTQ-Velos (false discovery rate = 1% for both). There were obvious trends in success rate, sequence coverage, and spectral counts with protein concentration; however, protein identification, sequence coverage, and spectral counts vary greatly within concentration levels.
Collapse
Affiliation(s)
- Andrew Bauman
- Seattle Children's Research Institute, Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, High-throughput Analysis Core, Seattle, Washington 98109, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Paddock MN, Bauman AT, Higdon R, Kolker E, Takeda S, Scharenberg AM. Competition between PARP-1 and Ku70 control the decision between high-fidelity and mutagenic DNA repair. DNA Repair (Amst) 2011; 10:338-43. [PMID: 21256093 DOI: 10.1016/j.dnarep.2010.12.005] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2010] [Revised: 11/29/2010] [Accepted: 12/13/2010] [Indexed: 12/26/2022]
Abstract
Affinity maturation of antibodies requires a unique process of targeted mutation that allows changes to accumulate in the antibody genes while the rest of the genome is protected from off-target mutations that can be oncogenic. This targeting requires that the same deamination event be repaired either by a mutagenic or a high-fidelity pathway depending on the genomic location. We have previously shown that the BRCT domain of the DNA-damage sensor PARP-1 is required for mutagenic repair occurring in the context of IgH and IgL diversification in the chicken B cell line DT40. Here we show that immunoprecipitation of the BRCT domain of PARP-1 pulls down Ku70 and the DNA-PK complex although the BRCT domain of PARP-1 does not bind DNA, suggesting that this interaction is not DNA dependent. Through sequencing the IgL variable region in PARP-1(-/-) cells that also lack Ku70 or Lig4, we show that Ku70 or Lig4 deficiency restores GCV to PARP-1(-/-) cells and conclude that the mechanism by which PARP-1 is promoting mutagenic repair is by inhibiting high-fidelity repair which would otherwise be mediated by Ku70 and Lig4.
Collapse
Affiliation(s)
- M N Paddock
- Seattle Children's Hospital Research Institute, 1900 9th Ave., Seattle, WA 98101, USA
| | | | | | | | | | | |
Collapse
|
9
|
The antiretroviral lectin cyanovirin-N targets well-known and novel targets on the surface of Entamoeba histolytica trophozoites. EUKARYOTIC CELL 2010; 9:1661-8. [PMID: 20852023 DOI: 10.1128/ec.00166-10] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Entamoeba histolytica, the protist that causes amebic dysentery and liver abscess, has a truncated Asn-linked glycan (N-glycan) precursor composed of seven sugars (Man(5)GlcNAc(2)). Here, we show that glycoproteins with unmodified N-glycans are aggregated and capped on the surface of E. histolytica trophozoites by the antiretroviral lectin cyanovirin-N and then replenished from large intracellular pools. Cyanovirin-N cocaps the Gal/GalNAc adherence lectin, as well as glycoproteins containing O-phosphodiester-linked glycans recognized by an anti-proteophosphoglycan monoclonal antibody. Cyanovirin-N inhibits phagocytosis by E. histolytica trophozoites of mucin-coated beads, a surrogate assay for amebic virulence. For technical reasons, we used the plant lectin concanavalin A rather than cyanovirin-N to enrich secreted and membrane proteins for mass spectrometric identification. E. histolytica glycoproteins with occupied N-glycan sites include Gal/GalNAc lectins, proteases, and 17 previously hypothetical proteins. The latter glycoproteins, as well as 50 previously hypothetical proteins enriched by concanavalin A, may be vaccine targets as they are abundant and unique. In summary, the antiretroviral lectin cyanovirin-N binds to well-known and novel targets on the surface of E. histolytica that are rapidly replenished from large intracellular pools.
Collapse
|
10
|
Hather G, Higdon R, Bauman A, von Haller PD, Kolker E. Estimating false discovery rates for peptide and protein identification using randomized databases. Proteomics 2010; 10:2369-76. [PMID: 20391536 DOI: 10.1002/pmic.200900619] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
MS-based proteomics characterizes protein contents of biological samples. The most common approach is to first match observed MS/MS peptide spectra against theoretical spectra from a protein sequence database and then to score these matches. The false discovery rate (FDR) can be estimated as a function of the score by searching together the protein sequence database and its randomized version and comparing the score distributions of the randomized versus nonrandomized matches. This work introduces a straightforward isotonic regression-based method to estimate the cumulative FDRs and local FDRs (LFDRs) of peptide identification. Our isotonic method not only performed as well as other methods used for comparison, but also has the advantages of being: (i) monotonic in the score, (ii) computationally simple, and (iii) not dependent on assumptions about score distributions. We demonstrate the flexibility of our approach by using it to estimate FDRs and LFDRs for protein identification using summaries of the peptide spectra scores. We reconfirmed that several of these methods were superior to a two-peptide rule. Finally, by estimating both the FDRs and LFDRs, we showed for both peptide and protein identification, moderate FDR values (5%) corresponded to large LFDR values (53 and 60%).
Collapse
Affiliation(s)
- Gregory Hather
- Bioinformatics & High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, WA 98101, USA
| | | | | | | | | |
Collapse
|
11
|
Higdon R, Haynes W, Kolker E. Meta-analysis for protein identification: a case study on yeast data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 14:309-14. [PMID: 20569183 DOI: 10.1089/omi.2010.0034] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Large amounts of mass spectrometry (MS) proteomics data are now publicly available; however, little attention has been given to how to best combine these data and assess the error rates for protein identification. The objective of this article is to show how variation in the type and amount of data included with each study impacts coverage of the yeast proteome and estimation of the false discovery rate (FDR). Our analysis of a subset of the publicly available yeast data showed that failure to reevaluate the FDR when combining protein IDs from different experiments resulted in an underestimation of the FDR by approximately threefold. A worst-case approximation of the FDR was only slightly larger than estimating the FDR by randomized database matches. The use of a weighted model to emphasize the most informative experimental data provided an increase in the number of IDs at a 1% FDR when compared to other meta-analysis approaches. Also, using an FDR higher than 1% results in a very high rate of false discoveries for IDs above the 1% threshold. Ideally, raw MS data will be made publicly available for complete and consistent reanalysis. In the circumstance that raw data is not available, determining a combined FDR on the basis of the worst-case estimation provides a reasonable approximation of the FDR. When combining experimental results, adding additional experiments results in diminishing and in some cases negative returns on protein identifications. It may be beneficial to include only those experiments generating the most unique identifications due to solid experimental design and sensitive instrumentation.
Collapse
Affiliation(s)
- Roger Higdon
- Bioinformatics & High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington 98101, USA
| | | | | |
Collapse
|
12
|
Joo JWJ, Na S, Baek JH, Lee C, Paek E. Target-Decoy with Mass Binning: a simple and effective validation method for shotgun proteomics using high resolution mass spectrometry. J Proteome Res 2010; 9:1150-6. [PMID: 19908919 DOI: 10.1021/pr9006377] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Shotgun proteomics using mass spectrometry (MS) has become the choice for large-scale peptide and protein identification. The recent development of high-resolution mass spectrometers such as FT-ICR or Orbitrap makes it possible to identify peptides within only a few parts per million (ppm), and it is expected to dramatically improve performance of peptide identification, as compared to low-resolution instruments. To fully exploit such significantly higher mass accuracy, however, appropriate data analysis methods are required. Here, we present a new target-decoy strategy, called Target-Decoy with Mass Binning, utilizing high mass accuracy for peptide identification validation, which remains a challenging problem in MS-based proteomics. When tested on various high-resolution MS data, our method was very effective and yet simple and showed comparable or better performance when compared with other validation methods.
Collapse
Affiliation(s)
- Jong Wha J Joo
- Korea Institute of Science and Technology, Seoul, Republic of Korea
| | | | | | | | | |
Collapse
|
13
|
Yu K, Sabelli A, DeKeukelaere L, Park R, Sindi S, Gatsonis CA, Salomon A. Integrated platform for manual and high-throughput statistical validation of tandem mass spectra. Proteomics 2009; 9:3115-25. [PMID: 19526561 DOI: 10.1002/pmic.200800899] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
As proteomic data sets increase in size and complexity, the necessity for database-centric software systems able to organize, compare, and visualize all the proteomic experiments in a lab grows. We recently developed an integrated platform called high-throughput autonomous proteomic pipeline (HTAPP) for the automated acquisition and processing of quantitative proteomic data, and integration of proteomic results with existing external protein information resources within a lab-based relational database called PeptideDepot. Here, we introduce the peptide validation software component of this system, which combines relational database-integrated electronic manual spectral annotation in Java with a new software tool in the R programming language for the generation of logistic regression spectral models from user-supplied validated data sets and flexible application of these user-generated models in automated proteomic workflows. This logistic regression spectral model uses both variables computed directly from SEQUEST output in addition to deterministic variables based on expert manual validation criteria of spectral quality. In the case of linear quadrupole ion trap (LTQ) or LTQ-FTICR LC/MS data, our logistic spectral model outperformed both XCorr (242% more peptides identified on average) and the X!Tandem E-value (87% more peptides identified on average) at a 1% false discovery rate estimated by decoy database approach.
Collapse
Affiliation(s)
- Kebing Yu
- Department of Chemistry, Brown University, Providence, RI 02903, USA
| | | | | | | | | | | | | |
Collapse
|
14
|
Shao C, Sun W, Li F, Yang R, Zhang L, Gao Y. Oscore: a combined score to reduce false negative rates for peptide identification in tandem mass spectrometry analysis. JOURNAL OF MASS SPECTROMETRY : JMS 2009; 44:25-31. [PMID: 18698557 DOI: 10.1002/jms.1466] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Tandem mass spectrometry (MS/MS) has been widely used in proteomics studies. Multiple algorithms have been developed for assessing matches between MS/MS spectra and peptide sequences in databases. However, it is still a challenge to reduce false negative rates without compromising the high confidence of peptide identification. In this study, we developed the score, Oscore, by logistic regression using SEQUEST and AMASS variables to identify fully tryptic peptides. Since these variables showed complicated association with each other, combining them together rather than applying them to a threshold model improved the classification of correct and incorrect peptide identifications. Oscore achieved both a lower false negative rate and a lower false positive rate than PeptideProphet on datasets from 18 known protein mixtures and several proteome-scale samples of different complexity, database size and separation methods. By a three-way comparison among Oscore, PeptideProphet and another logistic regression model which made use of PeptideProphet's variables, the main contributor for the improvement made by Oscore is discussed.
Collapse
Affiliation(s)
- Chen Shao
- Department of Physiology and Pathophysiology, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing, China
| | | | | | | | | | | |
Collapse
|
15
|
Higdon R, Hogan JM, Kolker N, van Belle G, Kolker E. Experiment-specific estimation of peptide identification probabilities using a randomized database. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2008; 11:351-65. [PMID: 18092908 DOI: 10.1089/omi.2007.0040] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.
Collapse
Affiliation(s)
- Roger Higdon
- Seattle Children's Hospital and Regional Medical Center, Seattle, WA 98101, USA
| | | | | | | | | |
Collapse
|
16
|
Utilization of DNA as a sole source of phosphorus, carbon, and energy by Shewanella spp.: ecological and physiological implications for dissimilatory metal reduction. Appl Environ Microbiol 2007; 74:1198-208. [PMID: 18156329 DOI: 10.1128/aem.02026-07] [Citation(s) in RCA: 95] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The solubility of orthophosphate (PO4(3-)) in iron-rich sediments can be exceedingly low, limiting the bioavailability of this essential nutrient to microbial populations that catalyze critical biogeochemical reactions. Here we demonstrate that dissolved extracellular DNA can serve as a sole source of phosphorus, as well as carbon and energy, for metal-reducing bacteria of the genus Shewanella. Shewanella oneidensis MR-1, Shewanella putrefaciens CN32, and Shewanella sp. strain W3-18-1 all grew with DNA but displayed different growth rates. W3-18-1 exhibited the highest growth rate with DNA. While strain W3-18-1 displayed Ca2+-independent DNA utilization, both CN32 and MR-1 required millimolar concentrations of Ca2+ for growth with DNA. For S. oneidensis MR-1, the utilization of DNA as a sole source of phosphorus is linked to the activities of extracellular phosphatase(s) and a Ca2+-dependent nuclease(s), which are regulated by phosphorus availability. Mass spectrometry analysis of the extracellular proteome of MR-1 identified one putative endonuclease (SO1844), a predicted UshA (bifunctional UDP-sugar hydrolase/5' nucleotidase), a predicted PhoX (calcium-activated alkaline phosphatase), and a predicted CpdB (bifunctional 2',3' cyclic nucleotide 2' phosphodiesterase/3' nucleotidase), all of which could play important roles in the extracellular degradation of DNA under phosphorus-limiting conditions. Overall, the results of this study suggest that the ability to use exogenous DNA as the sole source of phosphorus is widespread among the shewanellae, and perhaps among all prokaryotes, and may be especially important for nutrient cycling in metal-reducing environments.
Collapse
|
17
|
Kolker E, Hogan JM, Higdon R, Kolker N, Landorf E, Yakunin AF, Collart FR, van Belle G. Development of BIATECH-54 standard mixtures for assessment of protein identification and relative expression. Proteomics 2007; 7:3693-8. [PMID: 17890649 DOI: 10.1002/pmic.200700088] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Mixtures of known proteins have been very useful in the assessment and validation of methods for high-throughput (HTP) MS (MS/MS) proteomics experiments. However, these test mixtures have generally consisted of few proteins at near equal concentration or of a single protein at varied concentrations. Such mixtures are too simple to effectively assess the validity of error rates for protein identification and differential expression in HTP MS/MS studies. This work aimed at overcoming these limitations and simulating studies of complex biological samples. We introduced a pair of 54-protein standard mixtures of variable concentrations with up to a 1000-fold dynamic range in concentration and up to ten-fold expression ratios with additional negative controls (infinite expression ratios). These test mixtures comprised 16 off-the-shelf Sigma-Aldrich proteins and 38 Shewanella oneidensis proteins produced in-house. The standard proteins were systematically distributed into three main concentration groups (high, medium, and low) and then the concentrations were varied differently for each mixture within the groups to generate different expression ratios. The mixtures were analyzed with both low mass accuracy LCQ and high mass accuracy FT-LTQ instruments. In addition, these 54 standard proteins closely follow the molecular weight distributions of both bacterial and human proteomes. As a result, these new standard mixtures allow for a much more realistic assessment of approaches for protein identification and label-free differential expression than previous mixtures. Finally, methodology and experimental design developed in this work can be readily applied in future to development of more complex standard mixtures for HTP proteomics studies.
Collapse
|
18
|
Abstract
MOTIVATION Tandem mass-spectrometry of trypsin digests, followed by database searching, is one of the most popular approaches in high-throughput proteomics studies. Peptides are considered identified if they pass certain scoring thresholds. To avoid false positive protein identification, > or = 2 unique peptides identified within a single protein are generally recommended. Still, in a typical high-throughput experiment, hundreds of proteins are identified only by a single peptide. We introduce here a method for distinguishing between true and false identifications among single-hit proteins. The approach is based on randomized database searching and usage of logistic regression models with cross-validation. This approach is implemented to analyze three bacterial samples enabling recovery 68-98% of the correct single-hit proteins with an error rate of < 2%. This results in a 22-65% increase in number of identified proteins. Identifying true single-hit proteins will lead to discovering many crucial regulators, biomarkers and other low abundance proteins. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
19
|
Van Dellen KL, Chatterjee A, Ratner DM, Magnelli PE, Cipollo JF, Steffen M, Robbins PW, Samuelson J. Unique posttranslational modifications of chitin-binding lectins of Entamoeba invadens cyst walls. EUKARYOTIC CELL 2006; 5:836-48. [PMID: 16682461 PMCID: PMC1459681 DOI: 10.1128/ec.5.5.836-848.2006] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Entamoeba histolytica, which causes amebic dysentery and liver abscesses, is spread via chitin-walled cysts. The most abundant protein in the cyst wall of Entamoeba invadens, a model for amebic encystation, is a lectin called EiJacob1. EiJacob1 has five tandemly arrayed, six-Cys chitin-binding domains separated by low-complexity Ser- and Thr-rich spacers. E. histolytica also has numerous predicted Jessie lectins and chitinases, which contain a single, N-terminal eight-Cys chitin-binding domain. We hypothesized that E. invadens cyst walls are composed entirely of proteins with six-Cys or eight-Cys chitin-binding domains and that some of these proteins contain sugars. E. invadens genomic sequences predicted seven Jacob lectins, five Jessie lectins, and three chitinases. Reverse transcription-PCR analysis showed that mRNAs encoding Jacobs, Jessies, and chitinases are increased during E. invadens encystation, while mass spectrometry showed that the cyst wall is composed of an approximately 30:70 mix of Jacob lectins (cross-linking proteins) and Jessie and chitinase lectins (possible enzymes). Three Jacob lectins were cleaved prior to Lys at conserved sites (e.g., TPSVDK) in the Ser- and Thr-rich spacers between chitin-binding domains. A model peptide was cleaved at the same site by papain and E. invadens Cys proteases, suggesting that the latter cleave Jacob lectins in vivo. Some Jacob lectins had O-phosphodiester-linked carbohydrates, which were one to seven hexoses long and had deoxysugars at reducing ends. We concluded that the major protein components of the E. invadens cyst wall all contain chitin-binding domains (chitinases, Jessie lectins, and Jacob lectins) and that the Jacob lectins are differentially modified by site-specific Cys proteases and O-phosphodiester-linked glycans.
Collapse
Affiliation(s)
- Katrina L Van Dellen
- Department of Molecular and Cell Biology, Boston University Goldman School of Dental Medicine, Boston, MA 02118, USA
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Hogan JM, Higdon R, Kolker E. Experimental Standards for High-Throughput Proteomics. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2006; 10:152-7. [PMID: 16901220 DOI: 10.1089/omi.2006.10.152] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Proteome analysis, utilizing high-throughput proteomics approaches, involves studying proteins that a whole organism (or specific tissue or cellular compartment) expresses under certain conditions. Intrinsic difficulties of these studies, as well as the enormous volumes of data they typically produce, make the proteome analysis and interpretation very difficult. As with any high-throughput approach, proteomics experiments should be carefully designed, analyzed, and verified. In addition to computational standards,experimental standards--simple and complex mixtures of known proteins--for high-throughput proteomics have to be developed and utilized. This article discusses such experimental standards and their implementations.
Collapse
Affiliation(s)
- Jason M Hogan
- The BIATECH Institute, Bothell, Washington 98011, USA
| | | | | |
Collapse
|
21
|
Kolker E, Higdon R, Hogan JM. Protein identification and expression analysis using mass spectrometry. Trends Microbiol 2006; 14:229-35. [PMID: 16603360 DOI: 10.1016/j.tim.2006.03.005] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2005] [Revised: 03/02/2006] [Accepted: 03/22/2006] [Indexed: 11/28/2022]
Abstract
The identification and quantification of the proteins that a whole organism expresses under certain conditions is a main focus of high-throughput proteomics. Advanced proteomics approaches generate new biologically relevant data and potent hypotheses. A practical report of what proteome studies can and cannot accomplish in common laboratory settings is presented here. The review discusses the most popular tandem mass-spectrometry-based methods and focuses on how to produce reliable results. A step-by-step description of proteome experiments is given, including sample preparation, digestion, labeling, liquid chromatography, data processing, database searching and statistical analysis. The difficulties and bottlenecks of proteome analysis are addressed and the requirements for further improvements are discussed. Several diverse high-throughput proteomics-based studies of microorganisms are described.
Collapse
Affiliation(s)
- Eugene Kolker
- The BIATECH Institute, 19310 North Creek Parkway, Suite 115, Bothell, WA 98011, USA.
| | | | | |
Collapse
|
22
|
Higdon R, Hogan JM, Van Belle G, Kolker E. Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2006; 9:364-79. [PMID: 16402894 DOI: 10.1089/omi.2005.9.364] [Citation(s) in RCA: 72] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence."
Collapse
Affiliation(s)
- Roger Higdon
- The BIATECH Institute, 19310 N. Creek Parkway South, Suite 115, Bothell, WA 98011, USA
| | | | | | | |
Collapse
|
23
|
Hogan JM, Higdon R, Kolker N, Kolker E. Charge State Estimation for Tandem Mass Spectrometry Proteomics. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2005; 9:233-50. [PMID: 16209638 DOI: 10.1089/omi.2005.9.233] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
High-throughput protein analysis by tandem mass spectrometry produces anywhere from thousands to millions of spectra that are being used for peptide and protein identifications. Though each spectrum corresponds only to one charged peptide (ion) state, repetitive database searches of multiple charge states are typically conducted since the resolution of many common mass spectrometers is not sufficient to determine the charge state. The resulting database searches are both error-prone and time-consuming. We describe a straightforward, accurate approach on charge state estimation (CHASTE). CHASTE relies on fragment ion peak distributions, and by using reliable logistic regression models, combines different measurements to improve its accuracy. CHASTE's performance has been validated on data sets, comprised of known peptide dissociation spectra, obtained by replicate analyses of our earlier developed protein standard mixture using ion trap mass spectrometers at different laboratories. CHASTE was able to reduce number of needed database searches by at least 60% and the number of redundant searches by at least 90% virtually without any informational loss. This greatly alleviates one of the major bottlenecks in high throughput peptide and protein identifications. Thresholds and parameter estimates can be tailored to specific analysis situations, pipelines, and instrumentations. CHASTE was implemented in Java GUI-based and command-line-based interfaces.
Collapse
|