51
|
Hugo A, Baxter DJ, Cannon WR, Kalyanaraman A, Kulkarni G, Callister SJ. Proteotyping of microbial communities by optimization of tandem mass spectrometry data interpretation. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2012:225-234. [PMID: 22174278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We report the development of a novel high performance computing method for the identification of proteins from unknown (environmental) samples. The method uses computational optimization to provide an effective way to control the false discovery rate for environmental samples and complements de novo peptide sequencing. Furthermore, the method provides information based on the expressed protein in a microbial community, and thus complements DNA-based identification methods. Testing on blind samples demonstrates that the method provides 79-95% overlap with analogous results from searches involving only the correct genomes. We provide scaling and performance evaluations for the software that demonstrate the ability to carry out large-scale optimizations on 1258 genomes containing 4.2M proteins.
Collapse
|
52
|
Schäfer M, Lkhagvasuren O, Klein HU, Elling C, Wüstefeld T, Müller-Tidow C, Zender L, Koschmieder S, Dugas M, Ickstadt K. Integrative analyses for omics data: a Bayesian mixture model to assess the concordance of ChIP-chip and ChIP-seq measurements. JOURNAL OF TOXICOLOGY AND ENVIRONMENTAL HEALTH. PART A 2012; 75:461-470. [PMID: 22686305 DOI: 10.1080/15287394.2012.674914] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
The analysis of different variations in genomics, transcriptomics, epigenomics, and proteomics has increased considerably in recent years. This is especially due to the success of microarray and, more recently, sequencing technology. Apart from understanding mechanisms of disease pathogenesis on a molecular basis, for example in cancer research, the challenge of analyzing such different data types in an integrated way has become increasingly important also for the validation of new sequencing technologies with maximum resolution. For this purpose, a methodological framework for their comparison with microarray techniques in the context of smallest sample sizes, which result from the high costs of experiments, is proposed in this contribution. Based on an adaptation of the externally centered correlation coefficient ( Schäfer et al. 2009 ), it is demonstrated how a Bayesian mixture model can be applied to compare and classify measurements of histone acetylation that stem from chromatin immunoprecipitation combined with either microarray (ChIP-chip) or sequencing techniques (ChIP-seq) for the identification of DNA fragments. Here, the murine hematopoietic cell line 32D, which was transduced with the oncogene BCR-ABL, the hallmark of chronic myeloid leukemia, was characterized. Cells were compared to mock-transduced cells as control. Activation or inhibition of other genes by histone modifications induced by the oncogene is considered critical in such a context for the understanding of the disease.
Collapse
|
53
|
Oeltze S, Freiler W, Hillert R, Doleisch H, Preim B, Schubert W. Interactive, graph-based visual analysis of high-dimensional, multi-parameter fluorescence microscopy data in toponomics. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2011; 17:1882-1891. [PMID: 22034305 DOI: 10.1109/tvcg.2011.217] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
In Toponomics, the function protein pattern in cells or tissue (the toponome) is imaged and analyzed for applications in toxicology, new drug development and patient-drug-interaction. The most advanced imaging technique is robot-driven multi-parameter fluorescence microscopy. This technique is capable of co-mapping hundreds of proteins and their distribution and assembly in protein clusters across a cell or tissue sample by running cycles of fluorescence tagging with monoclonal antibodies or other affinity reagents, imaging, and bleaching in situ. The imaging results in complex multi-parameter data composed of one slice or a 3D volume per affinity reagent. Biologists are particularly interested in the localization of co-occurring proteins, the frequency of co-occurrence and the distribution of co-occurring proteins across the cell. We present an interactive visual analysis approach for the evaluation of multi-parameter fluorescence microscopy data in toponomics. Multiple, linked views facilitate the definition of features by brushing multiple dimensions. The feature specification result is linked to all views establishing a focus+context visualization in 3D. In a new attribute view, we integrate techniques from graph visualization. Each node in the graph represents an affinity reagent while each edge represents two co-occurring affinity reagent bindings. The graph visualization is enhanced by glyphs which encode specific properties of the binding. The graph view is equipped with brushing facilities. By brushing in the spatial and attribute domain, the biologist achieves a better understanding of the function protein patterns of a cell. Furthermore, an interactive table view is integrated which summarizes unique fluorescence patterns. We discuss our approach with respect to a cell probe containing lymphocytes and a prostate tissue section.
Collapse
|
54
|
Ng SK, Tan SH. DISCOVERING PROTEIN–PROTEIN INTERACTIONS. J Bioinform Comput Biol 2011; 1:711-41. [PMID: 15290761 DOI: 10.1142/s0219720004000600] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2003] [Revised: 12/12/2003] [Accepted: 12/13/2003] [Indexed: 11/18/2022]
Abstract
The ongoing genomics and proteomics efforts have helped identify many new genes and proteins in living organisms. However, simply knowing the existence of genes and proteins does not tell us much about the biological processes in which they participate. Many major biological processes are controlled by protein interaction networks. A comprehensive description of protein–protein interactions is therefore necessary to understand the genetic program of life. In this tutorial, we provide an overview of the various current high-throughput methods for discovering protein–protein interactions, covering both the conventional experimental methods and new computational approaches.
Collapse
|
55
|
Berg D, Wolff C, Langer R, Schuster T, Feith M, Slotta-Huspenina J, Malinowsky K, Becker KF. Discovery of new molecular subtypes in oesophageal adenocarcinoma. PLoS One 2011; 6:e23985. [PMID: 21966358 PMCID: PMC3179464 DOI: 10.1371/journal.pone.0023985] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2011] [Accepted: 07/28/2011] [Indexed: 12/22/2022] Open
Abstract
A large number of patients suffering from oesophageal adenocarcinomas do not respond to conventional chemotherapy; therefore, it is necessary to identify new predictive biomarkers and patient signatures to improve patient outcomes and therapy selections. We analysed 87 formalin-fixed and paraffin-embedded (FFPE) oesophageal adenocarcinoma tissue samples with a reverse phase protein array (RPPA) to examine the expression of 17 cancer-related signalling molecules. Protein expression levels were analysed by unsupervised hierarchical clustering and correlated with clinicopathological parameters and overall patient survival. Proteomic analyses revealed a new, very promising molecular subtype of oesophageal adenocarcinoma patients characterised by low levels of the HSP27 family proteins and high expression of those of the HER family with positive lymph nodes, distant metastases and short overall survival. After confirmation in other independent studies, our results could be the foundation for the development of a Her2-targeted treatment option for this new patient subgroup of oesophageal adenocarcinoma.
Collapse
|
56
|
|
57
|
Arabnia HR, Tran QN. Improved prediction of MHC class I binders/non-binders peptides through artificial neural network using variable learning rate: SARS corona virus, a case study. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2011; 696:223-9. [PMID: 21431562 PMCID: PMC7123181 DOI: 10.1007/978-1-4419-7046-6_22] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/09/2023]
Abstract
Fundamental step of an adaptive immune response to pathogen or vaccine is the binding of short peptides (also called epitopes) to major histocompatibility complex (MHC) molecules. The various prediction algorithms are being used to capture the MHC peptide binding preference, allowing the rapid scan of entire pathogen proteomes for peptide likely to bind MHC, saving the cost, effort, and time. However, the number of known binders/non-binders (BNB) to a specific MHC molecule is limited in many cases, which still poses a computational challenge for prediction. The training data should be adequate to predict BNB using any machine learning approach. In this study, variable learning rate has been demonstrated for training artificial neural network and predicting BNB for small datasets. The approach can be used for large datasets as well. The dataset for different MHC class I alleles for SARS Corona virus (Tor2 Replicase polyprotein 1ab) has been used for training and prediction of BNB. A total of 90 datasets (nine different MHC class I alleles with tenfold cross validation) have been retrieved from IEDB database for BNB. For fixed learning rate approach, the best value of AROC is 0.65, and in most of the cases it is 0.5, which shows the poor predictions. In case of variable learning rate, of the 90 datasets the value of AROC for 76 datasets is between 0.806 and 1.0 and for 7 datasets the value is between 0.7 and 0.8 and for rest of 7 datasets it is between 0.5 and 0.7, which indicates very good performance in most of the cases.
Collapse
|
58
|
Abstract
The intracellular levels and spatial localizations of metabolites and peptides reflect the state of a cell and its relationship to its surrounding environment. Moreover, the amounts and dynamics of metabolites and peptides are indicative of normal or pathological cellular conditions. Here we highlight established and evolving strategies for characterizing the metabolome and peptidome of single cells. Focused studies of the chemical composition of individual cells and functionally defined groups of cells promise to provide a greater understanding of cell fate, function and homeostatic balance. Single-cell bioanalytical microanalysis has also become increasingly valuable for examining cellular heterogeneity, particularly in the fields of neuroscience, stem cell biology and developmental biology.
Collapse
|
59
|
ten Have S, Boulon S, Ahmad Y, Lamond AI. Mass spectrometry-based immuno-precipitation proteomics - the user's guide. Proteomics 2011; 11:1153-9. [PMID: 21365760 PMCID: PMC3708439 DOI: 10.1002/pmic.201000548] [Citation(s) in RCA: 67] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2010] [Revised: 12/07/2010] [Accepted: 12/10/2010] [Indexed: 11/07/2022]
Abstract
Immuno-precipitation (IP) experiments using MS provide a sensitive and accurate way of characterising protein complexes and their response to regulatory mechanisms. Differences in stoichiometry can be determined as well as the reliable identification of specific binding partners. The quality control of IP and protein interaction studies has its basis in the biology that is being observed. Is that unusual protein identification a genuine novelty, or an experimental irregularity? Antibodies and the solid matrices used in these techniques isolate not only the target protein and its specific interaction partners but also many non-specific 'contaminants' requiring a structured analysis strategy. These methodological developments and the speed and accuracy of MS machines, which has been increasing consistently in the last 5 years, have expanded the number of proteins identified and complexity of analysis. The European Science Foundation's Frontiers in Functional Genomics programme 'Quality Control in Proteomics' Workshop provided a forum for disseminating knowledge and experience on this subject. Our aim in this technical brief is to outline clearly, for the scientists wanting to carry out this kind of experiment, and recommend what, in our experience, are the best potential ways to design an IP experiment, to help identify possible pitfalls, discuss important controls and outline how to manage and analyse the large amount of data generated. Detailed experimental methodologies have been referenced but not described in the form of protocols.
Collapse
|
60
|
Halligan BD, Greene AS. Visualize: a free and open source multifunction tool for proteomics data analysis. Proteomics 2011; 11:1058-63. [PMID: 21365761 PMCID: PMC3816356 DOI: 10.1002/pmic.201000556] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2010] [Revised: 11/19/2010] [Accepted: 11/29/2010] [Indexed: 12/25/2022]
Abstract
A major challenge in the field of high-throughput proteomics is the conversion of the large volume of experimental data that is generated into biological knowledge. Typically, proteomics experiments involve the combination and comparison of multiple data sets and the analysis and annotation of these combined results. Although there are some commercial applications that provide some of these functions, there is a need for a free, open source, multifunction tool for advanced proteomics data analysis. We have developed the Visualize program that provides users with the abilities to visualize, analyze, and annotate proteomics data; combine data from multiple runs, and quantitate differences between individual runs and combined data sets. Visualize is licensed under GNU GPL and can be downloaded from http://proteomics.mcw.edu/visualize. It is available as compiled client-based executable files for both Windows and Mac OS X platforms as well as PERL source code.
Collapse
|
61
|
Abstract
High-throughput technologies have enabled a rapid increase in the acquisition of data regarding cellular regulation, such as protein-protein interactions, gene expression profiling, proteomic analyses of changes in protein abundance, and global analyses of posttranslational modifications. The challenge now is for the community to devise adequate standards for assessing reliability and annotation, facilities for storage, mechanisms for sharing, and tools for visualization and analysis. In conjunction with Science (http://www.sciencemag.org/special/data), this issue of Science Signaling tackles some of the key issues related to the data deluge faced by cell signaling researchers.
Collapse
|
62
|
Abstract
High-throughput experiments in proteomics, such as 2-dimensional gel electrophoresis (2-DE) and mass spectrometry (MS), yield usually high-dimensional data sets of expression values for hundreds or thousands of proteins which are, however, observed on only a relatively small number of biological samples. Statistical methods for the planning and analysis of experiments are important to avoid false conclusions and to receive tenable results. In this chapter, the most frequent experimental designs for proteomics experiments are illustrated. In particular, focus is put on studies for the detection of differentially regulated proteins. Furthermore, issues of sample size planning, statistical analysis of expression levels as well as methods for data preprocessing are covered.
Collapse
|
63
|
Cooper B, Feng J, Garrett WM. Relative, label-free protein quantitation: spectral counting error statistics from nine replicate MudPIT samples. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2010; 21:1534-46. [PMID: 20541435 DOI: 10.1016/j.jasms.2010.05.001] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2010] [Revised: 04/30/2010] [Accepted: 05/03/2010] [Indexed: 05/03/2023]
Abstract
Nine replicate samples of peptides from soybean leaves, each spiked with a different concentration of bovine apotransferrin peptides, were analyzed on a mass spectrometer using multidimensional protein identification technology (MudPIT). Proteins were detected from the peptide tandem mass spectra, and the numbers of spectra were statistically evaluated for variation between samples. The results corroborate prior knowledge that combining spectra from replicate samples increases the number of identifiable proteins and that a summed spectral count for a protein increases linearly with increasing molar amounts of protein. Furthermore, statistical analysis of spectral counts for proteins in two- and three-way comparisons between replicates and combined replicates revealed little significant variation arising from run-to-run differences or data-dependent instrument ion sampling that might falsely suggest differential protein accumulation. In these experiments, spectral counting was enabled by PANORAMICS, probability-based software that predicts proteins detected by sets of observed peptides. Three alternative approaches to counting spectra were also evaluated by comparison. As the counting thresholds were changed from weaker to more stringent, the accuracy of ratio determination also changed. These results suggest that thresholds for counting can be empirically set to improve relative quantitation. All together, the data confirm the accuracy and reliability of label-free spectral counting in the relative, quantitative analysis of proteins between samples.
Collapse
|
64
|
Suwa M, Ono Y. Computational overview of GPCR gene universe to support reverse chemical genomics study. Methods Mol Biol 2010; 577:41-54. [PMID: 19718507 DOI: 10.1007/978-1-60761-232-2_4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Abstract
In order to support high-throughput screening for ligands of G-protein coupled receptors (GPCRs) by using bioinformatics technology, we introduce a database (SEVENS) with genome-scale annotation and software (GRIFFIN) that can simulate GPCR function. SEVENS ( http://sevens.cbrc.jp/ ) is an integrated database that includes GPCR genes that are identified with high accuracy (99.4% sensitivity and 96.6% specificity) from various types of genomes, by a pipeline that integrates such software as a gene finder, a sequence alignment tool, a motif and domain assignment tool, and a transmembrane helix (TMH) predictor. SEVENS provides the user a genome-scale overview of the "GPCR universe" with detailed information of chromosomal mapping, phylogenetic tree, protein sequence and structure, and experimental evidence, all of which are accessible via a user-friendly interface. GRIFFIN ( http://griffin.cbrc.jp/ ) can predict GPCR and G-protein coupling selectivity induced by ligand binding with high sensitivity and specificity (more than 87% on average), based on the support vector machine (SVM) and hidden Markov Model (HMM). SEVENS and GRIFFIN are expected to contribute to revealing the function of orphan and unknown GPCRs.
Collapse
|
65
|
Caffrey RE. A review of experimental design best practices for proteomics based biomarker discovery: focus on SELDI-TOF. Methods Mol Biol 2010; 641:167-183. [PMID: 20407947 DOI: 10.1007/978-1-60761-711-2_10] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Surface Enhanced Laser/Desorption Ionization-time of flight (SELDI-TOF) mass spectrometry is a technique uniquely suited to the study of the urine proteome due to its salt tolerance, high-throughput, and small sample requirements. However, due to the extreme sensitivity of the technique, sample collection and storage conditions, as well as instrument protocols and analysis conditions, must be rigorously controlled to ensure that data generated and collected is accurate and free from artifacts. Robust and reproducible data sets can be generated and compared between clinical sites when experimental protocols are carefully standardized. This chapter aims to review known factors that cause irreproducible results so that the experiments may be designed with appropriate sample and process controls for successful biomarker discovery. A suggested protocol follows the review. A number of issues for study design are discussed and these are generally applicable to biomarker discovery experiments.
Collapse
|
66
|
Huttenhower C, Myers CL, Hibbs MA, Troyanskaya OG. Computational analysis of the yeast proteome: understanding and exploiting functional specificity in genomic data. Methods Mol Biol 2009; 548:273-93. [PMID: 19521830 DOI: 10.1007/978-1-59745-540-4_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Modern experimental techniques have produced a wealth of high-throughput data that has enabled the ongoing genomic revolution. As the field continues to integrate experimental and computational analyzes of this data, it is essential that performance evaluations of high-throughput results be carried out in a consistent and biologically informative manner. Here, we present an overview of evaluation techniques for high-throughput experimental data and computational methods, and we discuss a number of potential pitfalls in this process. These primarily involve the biological diversity of genomic data, which can be masked or misrepresented in overly simplified global evaluations. We describe systems for preserving information about biological context during dataset evaluation, which can help to ensure that multiple different evaluations are more directly comparable. This biological variety in high-throughput data can also be taken advantage of computationally through data integration and process specificity to produce richer systems-level predictions of cellular function. An awareness of these considerations can greatly improve the evaluation and analysis of any high-throughput experimental dataset.
Collapse
|
67
|
Eckel-Passow JE, Oberg AL, Therneau TM, Bergen HR. An insight into high-resolution mass-spectrometry data. Biostatistics 2009; 10:481-500. [PMID: 19325168 PMCID: PMC2697344 DOI: 10.1093/biostatistics/kxp006] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2007] [Revised: 03/12/2008] [Accepted: 02/23/2009] [Indexed: 11/15/2022] Open
Abstract
Mass spectrometry is a powerful tool with much promise in global proteomic studies. The discipline of statistics offers robust methodologies to extract and interpret high-dimensional mass-spectrometry data and will be a valuable contributor to the field. Here, we describe the process by which data are produced, characteristics of the data, and the analytical preprocessing steps that are taken in order to interpret the data and use it in downstream statistical analyses. Because of the complexity of data acquisition, statistical methods developed for gene expression microarray data are not directly applicable to proteomic data. Areas in need of statistical research for proteomic data include alignment, experimental design, abundance normalization, and statistical analysis.
Collapse
|
68
|
Zheng G, Li H, Wang C, Sheng Q, Fan H, Yang S, Liu B, Dai J, Zeng R, Xie L. A platform to standardize, store, and visualize proteomics experimental data. Acta Biochim Biophys Sin (Shanghai) 2009; 41:273-9. [PMID: 19352541 DOI: 10.1093/abbs/gmp010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
With the development of functional genomics research, large-scale proteomics studies are now widespread, presenting significant challenges for data storage, exchange, and analysis. Here we present the Integrated Proteomics Exploring Database (IPED) as a platform for managing proteomics experimental data (both process and result data). IPED is based on the schema of the Proteome Experimental Data Repository (PEDRo), and complies with the General Proteomics Standard (GPS) drafted by the Proteomics Standards Committee of the Human Proteome Organization. In our work, we developed three components for the IPED platform: the IPED client editor, IPED server software, and IPED web interface. The client editor collects experimental data and generates an extensible markup language (XML) data file compliant with PEDRo and GPS; the server software parses the XML data file and loads information into a core database; and the web interface displays experimental results, to provide a convenient graphic representation of data. Given software convenience and data abundance, IPED is a powerful platform for data exchange and presents an important resource for the proteomics community. In its current release, IPED is available at http://www.biosino.org/iped2.
Collapse
|
69
|
Webb-Robertson BJM, McCue LA, Beagley N, McDermott JE, Wunschel DS, Varnum SM, Hu JZ, Isern NG, Buchko GW, Mcateer K, Pounds JG, Skerrett SJ, Liggitt D, Frevert CW. A Bayesian integration model of high-throughput proteomics and metabolomics data for improved early detection of microbial infections. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009:451-63. [PMID: 19209722 PMCID: PMC4137860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
High-throughput (HTP) technologies offer the capability to evaluate the genome, proteome, and metabolome of an organism at a global scale. This opens up new opportunities to define complex signatures of disease that involve signals from multiple types of biomolecules. However, integrating these data types is difficult due to the heterogeneity of the data. We present a Bayesian approach to integration that uses posterior probabilities to assign class memberships to samples using individual and multiple data sources; these probabilities are based on lower-level likelihood functions derived from standard statistical learning algorithms. We demonstrate this approach on microbial infections of mice, where the bronchial alveolar lavage fluid was analyzed by three HTP technologies, two proteomic and one metabolomic. We demonstrate that integration of the three datasets improves classification accuracy to approximately 89% from the best individual dataset at approximately 83%. In addition, we present a new visualization tool called Visual Integration for Bayesian Evaluation (VIBE) that allows the user to observe classification accuracies at the class level and evaluate classification accuracies on any subset of available data types based on the posterior probability models defined for the individual and integrated data.
Collapse
|
70
|
Dudley JT, Butte AJ. Identification of discriminating biomarkers for human disease using integrative network biology. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009:27-38. [PMID: 19209693 PMCID: PMC2749008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
There is a strong clinical imperative to identify discerning molecular biomarkers of disease to inform diagnosis, prognosis, and treatment. Ideally, such biomarkers would be drawn from peripheral sources non-invasively to reduce costs and lower potential for complication. Advances in high-throughput genomics and proteomics have vastly increased the space of prospective molecular biomarkers. Consequently, the elucidation of molecular biomarkers of clinical importance often entails a genome- or proteome-wide search for candidates. Here we present a novel framework for the identification of disease-specific protein biomarkers through the integration of biofluid proteomes and inter-disease genomic relationships using a network paradigm. We created a blood plasma biomarker network by linking expression-based genomic profiles from 136 diseases to 1,028 detectable blood plasma proteins. We also created a urine biomarker network by linking genomic profiles from 127 diseases to 577 proteins detectable in urine. Through analysis of these molecular biomarker networks, we find that the majority (> 80%) of putative protein biomarkers are linked to multiple disease conditions. Thus, prospective disease-specific protein biomarkers are found in only a small subset of the biofluids proteomes. These findings illustrate the importance of considering shared molecular pathology across diseases when evaluating biomarker specificity. The proposed framework is amenable to integration with complimentary network models of biology, which could further constrain the biomarker candidate space, and establish a role for the understanding of multi-scale, inter-disease genomic relationships in biomarker discovery.
Collapse
|
71
|
Mottaz-Brewer HM, Norbeck AD, Adkins JN, Manes NP, Ansong C, Shi L, Rikihisa Y, Kikuchi T, Wong SW, Estep RD, Heffron F, Pasa-Tolic L, Smith RD. Optimization of proteomic sample preparation procedures for comprehensive protein characterization of pathogenic systems. J Biomol Tech 2008; 19:285-295. [PMID: 19183792 PMCID: PMC2628077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Mass spectrometry-based proteomics is a powerful analytical tool for investigating pathogens and their interactions within a host. The sensitivity of such analyses provides broad proteome characterization, but the sample-handling procedures must first be optimized to ensure compatibility with the technique and to maximize the dynamic range of detection. The decision-making process for determining optimal growth conditions, preparation methods, sample analysis methods, and data analysis techniques in our laboratory is discussed herein with consideration of the balance in sensitivity, specificity, and biomass losses during analysis of host-pathogen systems.
Collapse
|
72
|
Schmidt A, Gehlenborg N, Bodenmiller B, Mueller LN, Campbell D, Mueller M, Aebersold R, Domon B. An integrated, directed mass spectrometric approach for in-depth characterization of complex peptide mixtures. Mol Cell Proteomics 2008; 7:2138-50. [PMID: 18511481 PMCID: PMC2577211 DOI: 10.1074/mcp.m700498-mcp200] [Citation(s) in RCA: 122] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2007] [Revised: 04/25/2008] [Indexed: 11/06/2022] Open
Abstract
LC-MS/MS has emerged as the method of choice for the identification and quantification of protein sample mixtures. For very complex samples such as complete proteomes, the most commonly used LC-MS/MS method, data-dependent acquisition (DDA) precursor selection, is of limited utility. The limited scan speed of current mass spectrometers along with the highly redundant selection of the most intense precursor ions generates a bias in the pool of identified proteins toward those of higher abundance. A directed LC-MS/MS approach that alleviates the limitations of DDA precursor ion selection by decoupling peak detection and sequencing of selected precursor ions is presented. In the first stage of the strategy, all detectable peptide ion signals are extracted from high resolution LC-MS feature maps or aligned sets of feature maps. The selected features or a subset thereof are subsequently sequenced in sequential, non-redundant directed LC-MS/MS experiments, and the MS/MS data are mapped back to the original LC-MS feature map in a fully automated manner. The strategy, implemented on an LTQ-FT MS platform, allowed the specific sequencing of 2,000 features per analysis and enabled the identification of more than 1,600 phosphorylation sites using a single reversed phase separation dimension without the need for time-consuming prefractionation steps. Compared with conventional DDA LC-MS/MS experiments, a substantially higher number of peptides could be identified from a sample, and this increase was more pronounced for low intensity precursor ions.
Collapse
|
73
|
Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide functional network for the laboratory mouse. PLoS Comput Biol 2008; 4:e1000165. [PMID: 18818725 PMCID: PMC2527685 DOI: 10.1371/journal.pcbi.1000165] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2008] [Accepted: 07/21/2008] [Indexed: 11/19/2022] Open
Abstract
Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu. Functionally related proteins interact in diverse ways to carry out biological processes, and each protein often participates in multiple pathways. Proteins are therefore organized into a complex network through which different functions of the cell are carried out. An accurate description of such a network is invaluable to our understanding of both the system-level features of a cell and those of an individual biological process. In this study, we used a probabilistic model to combine information from diverse genome-scale studies as well as individual investigations to generate a global functional network for mouse. Our analysis of the global topology of this network reveals biologically relevant systems-level characteristics of the mouse proteome, including conservation of functional neighborhoods and network features characteristic of known disease genes and key transcriptional regulators. We have made this network publicly available for search and dynamic exploration by researchers in the community. Our Web interface enables users to easily generate hypotheses regarding potential functional roles of uncharacterized proteins, investigate possible links between their proteins of interest and disease, and identify new players in specific biological processes.
Collapse
|
74
|
Mazumder R, Vasudevan S. Structure-guided comparative analysis of proteins: principles, tools, and applications for predicting function. PLoS Comput Biol 2008; 4:e1000151. [PMID: 18818720 PMCID: PMC2515338 DOI: 10.1371/journal.pcbi.1000151] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
75
|
Kim S, Gupta N, Pevzner PA. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J Proteome Res 2008; 7:3354-63. [PMID: 18597511 PMCID: PMC2689316 DOI: 10.1021/pr8001244] [Citation(s) in RCA: 332] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives ( spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Delta-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of "one-hit-wonders" in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.
Collapse
|