1
|
BatchDTA: implicit batch alignment enhances deep learning-based drug-target affinity estimation. Brief Bioinform 2022; 23:6632927. [PMID: 35794723 DOI: 10.1093/bib/bbac260] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 05/23/2022] [Accepted: 06/03/2022] [Indexed: 11/14/2022] Open
Abstract
Candidate compounds with high binding affinities toward a target protein are likely to be developed as drugs. Deep neural networks (DNNs) have attracted increasing attention for drug-target affinity (DTA) estimation owning to their efficiency. However, the negative impact of batch effects caused by measure metrics, system technologies and other assay information is seldom discussed when training a DNN model for DTA. Suffering from the data deviation caused by batch effects, the DNN models can only be trained on a small amount of 'clean' data. Thus, it is challenging for them to provide precise and consistent estimations. We design a batch-sensitive training framework, namely BatchDTA, to train the DNN models. BatchDTA implicitly aligns multiple batches toward the same protein through learning the orders of candidate compounds with respect to the batches, alleviating the impact of the batch effects on the DNN models. Extensive experiments demonstrate that BatchDTA facilitates four mainstream DNN models to enhance the ability and robustness on multiple DTA datasets (BindingDB, Davis and KIBA). The average concordance index of the DNN models achieves a relative improvement of 4.0%. The case study reveals that BatchDTA can successfully learn the ranking orders of the compounds from multiple batches. In addition, BatchDTA can also be applied to the fused data collected from multiple sources to achieve further improvement.
Collapse
|
2
|
Evaluating and minimizing batch effects in metabolomics. MASS SPECTROMETRY REVIEWS 2022; 41:421-442. [PMID: 33238061 DOI: 10.1002/mas.21672] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 10/27/2020] [Accepted: 10/29/2020] [Indexed: 06/11/2023]
Abstract
Determining metabolomic differences among samples of different phenotypes is a critical component of metabolomics research. With the rapid advances in analytical tools such as ultrahigh-resolution chromatography and mass spectrometry, an increasing number of metabolites can now be profiled with high quantification accuracy. The increased detectability and accuracy raise the level of stringiness required to reduce or control any experimental artifacts that can interfere with the measurement of phenotype-related metabolome changes. One of the artifacts is the batch effect that can be caused by multiple sources. In this review, we discuss the origins of batch effects, approaches to detect interbatch variations, and methods to correct unwanted data variability due to batch effects. We recognize that minimizing batch effects is currently an active research area, yet a very challenging task from both experimental and data processing perspectives. Thus, we try to be critical in describing the performance of a reported method with the hope of stimulating further studies for improving existing methods or developing new methods.
Collapse
|
3
|
Batch effect reduction of microarray data with dependent samples using an empirical Bayes approach (BRIDGE). Stat Appl Genet Mol Biol 2021; 20:101-119. [PMID: 34905304 PMCID: PMC9617207 DOI: 10.1515/sagmb-2021-0020] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 10/29/2021] [Indexed: 11/15/2022]
Abstract
Batch-effects present challenges in the analysis of high-throughput molecular data and are particularly problematic in longitudinal studies when interest lies in identifying genes/features whose expression changes over time, but time is confounded with batch. While many methods to correct for batch-effects exist, most assume independence across samples; an assumption that is unlikely to hold in longitudinal microarray studies. We propose Batch effect Reduction of mIcroarray data with Dependent samples usinGEmpirical Bayes (BRIDGE), a three-step parametric empirical Bayes approach that leverages technical replicate samples profiled at multiple timepoints/batches, so-called "bridge samples", to inform batch-effect reduction/attenuation in longitudinal microarray studies. Extensive simulation studies and an analysis of a real biological data set were conducted to benchmark the performance of BRIDGE against both ComBat and longitudinalComBat. Our results demonstrate that while all methods perform well in facilitating accurate estimates of time effects, BRIDGE outperforms both ComBat and longitudinal ComBat in the removal of batch-effects in data sets with bridging samples, and perhaps as a result, was observed to have improved statistical power for detecting genes with a time effect. BRIDGE demonstrated competitive performance in batch effect reduction of confounded longitudinal microarray studies, both in simulated and a real data sets, and may serve as a useful preprocessing method for researchers conducting longitudinal microarray studies that include bridging samples.
Collapse
|
4
|
Assessing the Batch Effects on Design and Analysis of Equivalence and Noninferiority Studies. Stat Biopharm Res 2019. [DOI: 10.1080/19466315.2019.1679245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
5
|
Integrative approaches to reconstruct regulatory networks from multi-omics data: A review of state-of-the-art methods. Comput Biol Chem 2019; 83:107120. [PMID: 31499298 DOI: 10.1016/j.compbiolchem.2019.107120] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 02/22/2019] [Accepted: 08/27/2019] [Indexed: 02/06/2023]
Abstract
Data generation using high throughput technologies has led to the accumulation of diverse types of molecular data. These data have different types (discrete, real, string, etc.) and occur in various formats and sizes. Datasets including gene expression, miRNA expression, protein-DNA binding data (ChIP-Seq/ChIP-ChIP), mutation data (copy number variation, single nucleotide polymorphisms), annotations, interactions, and association data are some of the commonly used biological datasets to study various cellular mechanisms of living organisms. Each of them provides a unique, complementary and partly independent view of the genome and hence embed essential information about the regulatory mechanisms of genes and their products. Therefore, integrating these data and inferring regulatory interactions from them offer a system level of biological insight in predicting gene functions and their phenotypic outcomes. To study genome functionality through regulatory networks, different methods have been proposed for collective mining of information from an integrated dataset. We survey here integration methods that reconstruct regulatory networks using state-of-the-art techniques to handle multi-omics (i.e., genomic, transcriptomic, proteomic) and other biological datasets.
Collapse
|
6
|
Abstract
Proteomic patterns derived from mass spectrometry have recently been put forth as potential biomarkers for the early diagnosis of cancer. This approach has generated much excitement, particularly as initial results reported on SELDI profiling of serum suggested that near perfect sensitivity and specificity could be achieved in diagnosing ovarian cancer. However, more recent reports have suggested that much of the observed structure could be due to the presence of experimental bias. A rebuttal to the findings of bias, subtitled “Producers and Consumers”, lists several objections. In this paper, we attempt to address these objections. While we continue to find evidence of experimental bias, we emphasize that the problems found are associated with experimental design and processing, and can be avoided in future studies.
Collapse
|
7
|
Statistical Contributions to Bioinformatics: Design, Modeling, Structure Learning, and Integration. STAT MODEL 2017; 17:245-289. [PMID: 29129969 PMCID: PMC5679480 DOI: 10.1177/1471082x17698255] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The advent of high-throughput multi-platform genomics technologies providing whole-genome molecular summaries of biological samples has revolutionalized biomedical research. These technologiees yield highly structured big data, whose analysis poses significant quantitative challenges. The field of Bioinformatics has emerged to deal with these challenges, and is comprised of many quantitative and biological scientists working together to effectively process these data and extract the treasure trove of information they contain. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, focusing on four areas: (1) experimental design and reproducibility, (2) preprocessing and feature extraction, (3) unified modeling, and (4) structure learning and integration. In each of these areas, we highlight some key contributions and try to elucidate the key statistical principles underlying these methods and approaches. Our goals are to demonstrate major ways in which statisticians have contributed to bioinformatics, encourage statisticians to get involved early in methods development as new technologies emerge, and to stimulate future methodological work based on the statistical principles elucidated in this article and utilizing all availble information to uncover new biological insights.
Collapse
|
8
|
Stratified randomization controls better for batch effects in 450K methylation analysis: a cautionary tale. Front Genet 2014; 5:354. [PMID: 25352862 PMCID: PMC4195366 DOI: 10.3389/fgene.2014.00354] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2014] [Accepted: 09/23/2014] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Batch effects in DNA methylation microarray experiments can lead to spurious results if not properly handled during the plating of samples. METHODS Two pilot studies examining the association of DNA methylation patterns across the genome with obesity in Samoan men were investigated for chip- and row-specific batch effects. For each study, the DNA of 46 obese men and 46 lean men were assayed using Illumina's Infinium HumanMethylation450 BeadChip. In the first study (Sample One), samples from obese and lean subjects were examined on separate chips. In the second study (Sample Two), the samples were balanced on the chips by lean/obese status, age group, and census region. We used methylumi, watermelon, and limma R packages, as well as ComBat, to analyze the data. Principal component analysis and linear regression were, respectively, employed to identify the top principal components and to test for their association with the batches and lean/obese status. To identify differentially methylated positions (DMPs) between obese and lean males at each locus, we used a moderated t-test. RESULTS Chip effects were effectively removed from Sample Two but not Sample One. In addition, dramatic differences were observed between the two sets of DMP results. After "removing" batch effects with ComBat, Sample One had 94,191 probes differentially methylated at a q-value threshold of 0.05 while Sample Two had zero differentially methylated probes. The disparate results from Sample One and Sample Two likely arise due to the confounding of lean/obese status with chip and row batch effects. CONCLUSION Even the best possible statistical adjustments for batch effects may not completely remove them. Proper study design is vital for guarding against spurious findings due to such effects.
Collapse
|
9
|
Proteomics analysis for finding serum markers of ovarian cancer. BIOMED RESEARCH INTERNATIONAL 2014; 2014:179040. [PMID: 25250314 PMCID: PMC4164372 DOI: 10.1155/2014/179040] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2014] [Accepted: 08/18/2014] [Indexed: 11/18/2022]
Abstract
A combination of peptide ligand library beads (PLLB) and 1D gel liquid chromatography-mass spectrometry/mass spectrometry (1DGel-LC-MS/MS) was employed to analyze serum samples from patients with ovarian cancer and from healthy controls. Proteomic analysis identified 1200 serum proteins, among which 57 proteins were upregulated and 10 were downregulated in the sera from cancer patients. Retinol binding protein 4 (RBP4) is highly upregulated in the ovarian cancer serum samples. ELISA was employed to measure plasma concentrations of RBP4 in 80 samples from ovarian cancer patients, healthy individuals, myoma patients, and patients with benign ovarian tumor, respectively. The plasma concentrations of RBP4 ranging from 76.91 to 120.08 ng/mL with the mean value 89.13 ± 1.67 ng/mL in ovarian cancer patients are significantly higher than those in healthy individuals (10.85 ± 2.38 ng/mL). Results were further confirmed with immunohistochemistry, demonstrating that RBP4 expression levels in normal ovarian tissue were lower than those in ovarian cancer tissues. Our results suggested that RBP4 is a potential biomarker for diagnostic of screening ovarian cancer.
Collapse
|
10
|
Advances in ovarian cancer proteomics: the quest for biomarkers and improved therapeutic interventions. Expert Rev Proteomics 2014; 5:551-60. [DOI: 10.1586/14789450.5.4.551] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
11
|
|
12
|
Technical aspects and inter-laboratory variability in native peptide profiling: The CE–MS experience. Clin Biochem 2013; 46:432-43. [DOI: 10.1016/j.clinbiochem.2012.09.025] [Citation(s) in RCA: 150] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2012] [Revised: 09/18/2012] [Accepted: 09/27/2012] [Indexed: 02/08/2023]
|
13
|
Abstract
Measurements from microarrays and other high-throughput technologies are susceptible to non-biological artifacts like batch effects. It is known that batch effects can alter or obscure the set of significant results and biological conclusions in high-throughput studies. Here we examine the impact of batch effects on predictors built from genomic technologies. To investigate batch effects, we collected publicly available gene expression measurements with known outcomes, and estimated batches using date. Using these data we show (1) the impact of batch effects on prediction depends on the correlation between outcome and batch in the training data, and (2) removing expression measurements most affected by batch before building predictors may improve the accuracy of those predictors. These results suggest that (1) training sets should be designed to minimize correlation between batches and outcome, and (2) methods for identifying batch-affected probes should be developed to improve prediction results for studies with high correlation between batches and outcome.
Collapse
|
14
|
Detection of bladder cancer using proteomic profiling of urine sediments. PLoS One 2012; 7:e42452. [PMID: 22879988 PMCID: PMC3411788 DOI: 10.1371/journal.pone.0042452] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2012] [Accepted: 07/06/2012] [Indexed: 12/14/2022] Open
Abstract
We used protein expression profiles to develop a classification rule for the detection and prognostic assessment of bladder cancer in voided urine samples. Using the Ciphergen PBS II ProteinChip Reader, we analyzed the protein profiles of 18 pairs of samples of bladder tumor and adjacent urothelium tissue, a training set of 85 voided urine samples (32 controls and 53 bladder cancer), and a blinded testing set of 68 voided urine samples (33 controls and 35 bladder cancer). Using t-tests, we identified 473 peaks showing significant differential expression across different categories of paired bladder tumor and adjacent urothelial samples compared to normal urothelium. Then the intensities of those 473 peaks were examined in a training set of voided urine samples. Using this approach, we identified 41 protein peaks that were differentially expressed in both sets of samples. The expression pattern of the 41 protein peaks was used to classify the voided urine samples as malignant or benign. This approach yielded a sensitivity and specificity of 59% and 90%, respectively, on the training set and 80% and 100%, respectively, on the testing set. The proteomic classification rule performed with similar accuracy in low- and high-grade bladder carcinomas. In addition, we used hierarchical clustering with all 473 protein peaks on 65 benign voided urine samples, 88 samples from patients with clinically evident bladder cancer, and 127 samples from patients with a history of bladder cancer to classify the samples into Cluster A or B. The tumors in Cluster B were characterized by clinically aggressive behavior with significantly shorter metastasis-free and disease-specific survival.
Collapse
|
15
|
Good mass spectrometry and its place in good science. JOURNAL OF MASS SPECTROMETRY : JMS 2012; 47:795-809. [PMID: 22707172 DOI: 10.1002/jms.3038] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
The mass spectrometry community has expanded as instruments became more powerful, user-friendly, affordable and readily available. This opens up opportunities for novice users to perform high impact research, using highly advanced instrumentation. This introductory tutorial is targeted at the novice user working in a research setting. It aims to offer the benefit of other people's experiences and to help newcomers avoid known pitfalls and problematic issues. It discusses some of the essential features of sound analytical chemistry and highlights the need to use validated analytical methods that provide high quality results along with a measure of their uncertainty. Examples are used to illustrate potential pitfalls and their consequences.
Collapse
|
16
|
Integrative Analysis of N-Linked Human Glycoproteomic Data Sets Reveals PTPRF Ectodomain as a Novel Plasma Biomarker Candidate for Prostate Cancer. J Proteome Res 2012; 11:2653-65. [DOI: 10.1021/pr201200n] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
17
|
High throughput profiling of serum phosphoproteins/peptides using the SELDI-TOF-MS platform. Methods Mol Biol 2012; 818:199-216. [PMID: 22083825 DOI: 10.1007/978-1-61779-418-6_14] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
Abstract
Protein phosphorylation is a dynamic post-translational modification that plays a critical role in the regulation of a wide spectrum of biological events and cellular functions including signal transduction, gene expression, cell proliferation, and apoptosis. Determination of the sites and magnitudes of protein phosphorylation has been an essential step in the analysis of the control of many biological systems. A high throughput analysis of phosphorylation of proteins would provide a simple, logical, and useful tool for a functional dissection and prediction of biological functions and signaling pathways in association with these important molecular events. We have developed a functional proteomics technique using the ProteinChip array-based SELDI-TOF-MS analysis for high throughput profiling of phosphoproteins/phosphopeptides in human serum for the early detection and diagnosis as well as for the molecular staging of human cancer. The methodology and experimental approach consists of five steps: (1) generation of a total peptide pool of serum proteins by a global trypsin digestion; (2) rapid isolation of phosphopeptides from the total serum peptide pool by an affinity selection, purification, and enrichment using a novel automated micro-bioprocessing system with phospho-antibody-conjugated paramagnetic beads and a hybrid magnet plate; (3) high throughput phosphopeptide analysis on ProteinChip arrays by automated SELDI-TOF-MS; and (4) bioinformatics and statistical methods for data analysis. This method with appropriate modifications may be equally applicable to serine-, threonine- and tyrosine-phosphorylated proteins and for selectively isolating, profiling, and identifying phosphopeptides present in a highly complex phosphor-peptide mixture prepared from various human specimens such as cells, tissue samples, and serum and other body fluids.
Collapse
|
18
|
Principles for the ethical analysis of clinical and translational research. Stat Med 2011; 30:2785-92. [PMID: 21751225 PMCID: PMC4465206 DOI: 10.1002/sim.4282] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2010] [Accepted: 04/05/2011] [Indexed: 01/13/2023]
Abstract
Statistical analysis is a cornerstone of the scientific method and evidence-based medicine, and statisticians serve an increasingly important role in clinical and translational research by providing objective evidence concerning the risks and benefits of novel therapeutics. Researchers rely on statistics and informatics as never before to generate and test hypotheses and to discover patterns of disease hidden within overwhelming amounts of data. Too often, clinicians and biomedical scientists are not adequately proficient in statistics to analyze data or interpret results, and statistical expertise may not be properly incorporated within the research process. We argue for the ethical imperative of statistical standards, and we present ten nontechnical principles that form a conceptual framework for the ethical application of statistics in clinical and translational research. These principles are drawn from the literature on the ethics of data analysis and the American Statistical Association Ethical Guidelines for Statistical Practice.
Collapse
|
19
|
A roadmap for successful applications of clinical proteomics. Proteomics Clin Appl 2011; 5:241-7. [PMID: 21523915 DOI: 10.1002/prca.201000096] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Revised: 12/08/2010] [Accepted: 12/14/2010] [Indexed: 11/10/2022]
Abstract
Despite over 30,000 publications on proteomics in the last decade, and the accumulation of extensive interesting information on the human proteome in diverse observations, the clinical translation of proteomics to-date has had major setbacks. I review here a roadmap for improving the success rate of clinical proteomics. The roadmap includes steps for improvements that need to be made in analytical tools, discovery, validation, clinical application, and post-clinical application appraisal. It is likely that most if not all of the components that are necessary for clinical success are either readily available, or should be possible to put in place with more rigorous research standards and concerted efforts of the research community, clinicians, and health agencies. Enthusiasm for the clinical impact of proteomics may need to be tempered currently until robust evidence can be obtained, but some clinical successes should eventually be feasible.
Collapse
|
20
|
Abstract
The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high-dimensional data and introduce the basic preprocessing procedures.
Collapse
|
21
|
Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 2010; 11:733-9. [PMID: 20838408 DOI: 10.1038/nrg2825] [Citation(s) in RCA: 1253] [Impact Index Per Article: 89.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.
Collapse
|
22
|
The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 2010; 28:827-38. [PMID: 20676074 DOI: 10.1038/nbt.1665] [Citation(s) in RCA: 581] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2010] [Accepted: 06/30/2010] [Indexed: 11/09/2022]
Abstract
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.
Collapse
|
23
|
Integrative proteomic analysis of serum and peritoneal fluids helps identify proteins that are up-regulated in serum of women with ovarian cancer. PLoS One 2010; 5:e11137. [PMID: 20559444 PMCID: PMC2886122 DOI: 10.1371/journal.pone.0011137] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2010] [Accepted: 05/26/2010] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND We used intensive modern proteomics approaches to identify predictive proteins in ovary cancer. We identify up-regulated proteins in both serum and peritoneal fluid. To evaluate the overall performance of the approach we track the behavior of 20 validated markers across these experiments. METHODOLOGY Mass spectrometry based quantitative proteomics following extensive protein fractionation was used to compare serum of women with serous ovarian cancer to healthy women and women with benign ovarian tumors. Quantitation was achieved by isotopically labeling cysteine amino acids. Label-free mass spectrometry was used to compare peritoneal fluid taken from women with serous ovarian cancer and those with benign tumors. All data were integrated and annotated based on whether the proteins have been previously validated using antibody-based assays. FINDINGS We selected 54 quantified serum proteins and 358 peritoneal fluid proteins whose case-control differences exceeded a predefined threshold. Seventeen proteins were quantified in both materials and 14 are extracellular. Of 19 validated markers that were identified all were found in cancer peritoneal fluid and a subset of 7 were quantified in serum, with one of these proteins, IGFBP1, newly validated here. CONCLUSION Proteome profiling applied to symptomatic ovarian cancer cases identifies a large number of up-regulated serum proteins, many of which are or have been confirmed by immunoassays. The number of currently known validated markers is highest in peritoneal fluid, but they make up a higher percentage of the proteins observed in both serum and peritoneal fluid, suggesting that the 10 additional markers in this group may be high quality candidates.
Collapse
|
24
|
Abstract
Proteomic profiling has the potential to impact the diagnosis, prognosis, and treatment of various diseases. A number of different proteomic technologies are available that allow us to look at many proteins at once, and all of them yield complex data that raise significant quantitative challenges. Inadequate attention to these quantitative issues can prevent these studies from achieving their desired goals, and can even lead to invalid results. In this chapter, we describe various ways the involvement of statisticians or other quantitative scientists in the study team can contribute to the success of proteomic research, and we outline some of the key statistical principles that should guide the experimental design and analysis of such studies.
Collapse
|
25
|
Improved reporting of statistical design and analysis: guidelines, education, and editorial policies. Methods Mol Biol 2010; 620:563-98. [PMID: 20652522 DOI: 10.1007/978-1-60761-580-4_22] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
A majority of original articles published in biomedical journals include some form of statistical analysis. Unfortunately, many of the articles contain errors in statistical design and/or analysis. These errors are worrisome, as the misuse of statistics jeopardizes the process of scientific discovery and the accumulation of scientific knowledge. To help avoid these errors and improve statistical reporting, four approaches are suggested: (1) development of guidelines for statistical reporting that could be adopted by all journals, (2) improvement in statistics curricula in biomedical research programs with an emphasis on hands-on teaching by biostatisticians, (3) expansion and enhancement of biomedical science curricula in statistics programs, and (4) increased participation of biostatisticians in the peer review process along with the adoption of more rigorous journal editorial policies regarding statistics. In this chapter, we provide an overview of these issues with emphasis to the field of molecular biology and highlight the need for continuing efforts on all fronts.
Collapse
|
26
|
Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology. Ann Appl Stat 2009. [DOI: 10.1214/09-aoas291] [Citation(s) in RCA: 213] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
27
|
Application of proteomics in ovarian cancer: Which sample should be used? Gynecol Oncol 2009; 115:497-503. [DOI: 10.1016/j.ygyno.2009.09.005] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2009] [Revised: 08/31/2009] [Accepted: 09/04/2009] [Indexed: 01/22/2023]
|
28
|
Abstract
In functional data classification, functional observations are often contaminated by various systematic effects, such as random batch effects caused by device artifacts, or fixed effects caused by sample-related factors. These effects may lead to classification bias and thus should not be neglected. Another issue of concern is the selection of functions when predictors consist of multiple functions, some of which may be redundant. The above issues arise in a real data application where we use fluorescence spectroscopy to detect cervical precancer. In this article, we propose a Bayesian hierarchical model that takes into account random batch effects and selects effective functions among multiple functional predictors. Fixed effects or predictors in nonfunctional form are also included in the model. The dimension of the functional data is reduced through orthonormal basis expansion or functional principal components. For posterior sampling, we use a hybrid Metropolis-Hastings/Gibbs sampler, which suffers slow mixing. An evolutionary Monte Carlo algorithm is applied to improve the mixing. Simulation and real data application show that the proposed model provides accurate selection of functional predictors as well as good classification.
Collapse
|
29
|
Use of ProteinChip technology for identifying biomarkers of parasitic diseases: the example of porcine cysticercosis (Taenia solium). Exp Parasitol 2008; 120:320-9. [PMID: 18823977 DOI: 10.1016/j.exppara.2008.08.013] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2008] [Revised: 08/19/2008] [Accepted: 08/21/2008] [Indexed: 01/06/2023]
Abstract
Taenia solium cysticercosis is a significant public health problem in endemic countries. The current serodiagnostic techniques are not able to differentiate between infections with viable cysts and infections with degenerated cysts. The objectives of this study were to identify specific novel biomarkers of these different disease stages in the serum of experimentally infected pigs using ProteinChip technology (Bio-Rad) and to validate these biomarkers by analyzing serum samples from naturally infected pigs. In the experimental sample set 30 discriminating biomarkers (p<0.05) were found, 13 specific for the viable phenotype, 9 specific for the degenerated phenotype and 8 specific for the infected phenotype (either viable or degenerated cysts). Only 3 of these biomarkers were also significant in the field samples; however, the peak profiles were not consistent among the two sample sets. Five biomarkers discovered in the sera from experimentally infected pigs were identified as clusterin, lecithin-cholesterol acyltransferase, vitronectin, haptoglobin and apolipoprotein A-I.
Collapse
|
30
|
Abstract
Proteomics, the large-scale study of protein expression in organisms, offers the potential to evaluate global changes in protein expression and their post-translational modifications that take place in response to normal or pathological stimuli. One challenge has been the requirement for substantial amounts of tissue in order to perform comprehensive proteomic characterization. In heterogeneous tissues, such as brain, this has limited the application of proteomic methodologies. Efforts to adapt standard methods of tissue sampling, protein extraction, arraying, and identification are reviewed, with an emphasis on those appropriate to smaller samples ranging in size from several microliters down to single cells. The effects of miniaturization on these analyses are highlighted using neuroscience-related examples, as are statistical issues unique to the high-dimensional datasets generated by proteomic experiments.
Collapse
|
31
|
Abstract
Considerable interest, speculation and controversy have been generated utilising surface-enhanced laser desorption/ionization in conjunction with mass spectrometry (SELDI-MS) for the diagnosis, prognosis and therapeutic monitoring of cancer and offers an attractive approach to cancer biomarker discovery from tissues and biological fluids. This technology utilises a combination of mass spectrometry and chromatography to facilitate protein profiling of complex biological mixtures. Compared to some other more traditional proteomic platforms, such as 2D polyacrylamide gel electrophoresis, it has a high-throughput capability and can resolve low-mass proteins. However, a considerable number of challenging issues related to the design of studies, including reproducibility, sensitivity, specificity, variation in sample collection, processing and storage, have been reported as problematic with this technology; albeit some of these concerns could perhaps also be lauded against other proteomic approaches that have attempted to address complex protein mixtures, such as plasma. Applications, successes and limitations of SELDI-MS in both clinical and basic science arenas will be reviewed in this article.
Collapse
|
32
|
Cancer biomarker discovery via low molecular weight serum proteome profiling - Where is the tumor? Proteomics Clin Appl 2007; 1:1545-58. [PMID: 21136654 DOI: 10.1002/prca.200700141] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2007] [Indexed: 11/11/2022]
Abstract
Time-course analyses of rapidly processed serum performed in parallel by SELDI and nanoscale LC-MS/MS have revealed the temporal correlation of several literature-based disease markers with ex vivo driven events such that their in vivo existence in healthy subjects is questionable. Identification by MS/MS reveals these putative biomarkers to be byproducts of the coagulation cascade and platelet activation and suggests plasmatic analysis may be preferred. In a pilot plasmatic study, a cohort of naïve prostate cancer (PCa) samples were uniformly distinguished from their age-matched controls (n = 20) on the basis of multiple peptidic components; most notably by a derivative of complement C(4) at 1863 m/z (GLEEELQFSLGSKINVK, C4(1353-1369) ). The fully tryptic nature of this and other putative PCa discriminants is consistent with the cleavage specificity of common blood proteases and questions the need for tumor-derived proteolytic activities as has been proposed. In light of the known correlation of disregulated hemostasis with malignant disease, we suggest the underlying differentiating phenomena in these types of analyses may lie in the temporal disparity of sample activation such that the case (patient) samples are preactivated while the control samples are not.
Collapse
|
33
|
Abstract
Proteomics holds the promise of evaluating global changes in protein expression and post-translational modification in response to environmental stimuli. However, difficulties in achieving cellular anatomic resolution and extracting specific types of proteins from cells have limited the efficacy of these techniques. Laser capture microdissection has provided a solution to the problem of anatomical resolution in tissues. New extraction methodologies have expanded the range of proteins identified in subsequent analyses. This review will examine the application of laser capture microdissection to proteomic tissue sampling, and subsequent extraction of these samples for differential expression analysis. Statistical and other quantitative issues important for the analysis of the highly complex datasets generated are also reviewed.
Collapse
|
34
|
How to improve reliability and efficiency of research about molecular markers: roles of phases, guidelines, and study design. J Clin Epidemiol 2007; 60:1205-19. [PMID: 17998073 DOI: 10.1016/j.jclinepi.2007.04.020] [Citation(s) in RCA: 135] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2006] [Revised: 04/01/2007] [Accepted: 04/12/2007] [Indexed: 11/29/2022]
Abstract
BACKGROUND AND OBJECTIVE The search for molecular markers for cancer, using "discovery-based" techniques, has resulted in claims of a very high degree of discrimination both for cancer diagnosis (e.g., serum proteomics patterns) and prognosis (e.g., RNA expression genomic signatures). However, many promising initial results have been found to be unreliable or not reproducible, and the larger process of discovery can seem slow and inefficient. To improve the process to develop molecular markers, proposals to use "phases" and "guidelines" have been made, based on experience with the process of drug development and randomized controlled clinical trials. The objective is to help improve the reliability and efficiency of development of molecular markers for cancer diagnosis. STUDY DESIGN AND SETTING The literature was searched to identify important current problems (in serum proteomics for cancer diagnosis and RNA expression genomics for cancer prognosis) are identified, and the roles of tools ("phases," "guidelines," and "study design") to address those problems are considered. Based on lessons learned, approaches for the future are discussed, some of which may seem "radical" compared with drug development. RESULTS Phases identify and organize questions to be addressed by individual studies. Guidelines identify features of design and conduct to be reported so that each study's reliability can be judged. Study design involves the myriad details and choices involved in actual planning and conduct of a study. Study design is most important in the sense of determining whether a study is reliable or not. Studies that are unreliable, because of problems from chance and bias, constitute a major current problem leading to inflated expectations, wasted effort, and inefficiency in the larger process of development. By considering fundamental principles, it may be possible to identify approaches that are different than those used in drug development, while preserving reliability and efficiency. CONCLUSION Phases and guidelines have important roles, but issues in study design address the fundamental problems that compromise reliability and efficiency. Tools to study markers are underdeveloped and will evolve over time, perhaps to include seemingly radical approaches.
Collapse
|
35
|
|
36
|
Abstract
The field of proteomics is developing at a rapid pace in the post-genome era. Translational proteomics investigations aim to apply a combination of established methods and new technologies to learn about protein expression profiles predictive of clinical events, therapeutic response, and underlying mechanisms. However, in contrast to genetic studies and in parallel with gene expression studies, the dynamic nature of the proteome in conjunction with the challenges of accounting for post-translational modifications requires the translational proteomics investigator to understand the strengths and limitations of proteomics approaches. In this review, we provide an overview of proteomics approaches and techniques, and proteomics informatics for clinical transplantation investigators. We also review recent publications pertaining to transplantation proteomics, and discuss the implications and utility of urine proteomics for non-invasive investigation of transplant outcomes.
Collapse
Key Words
- transplantation
- proteomics
- surface-enhanced laser desorption ionization
- matrix-assisted laser desorption ionization
- mass spectrometry
- biomarkers
- can, chronic allograft nephropathy
- cid, collision-induced dissociation
- elisa, enzyme-linked immunosorbent assay
- esi, electrospray ionization
- ft, fourier transform
- gbm, glomerular basement membrane
- gvhd, graft vs. host disease
- hplc, high performance liquid chromatography
- hsct, hematopoietic stem cell transplantation
- impdh, inosine monophosphate dehydrogenase
- lc, liquid chromatography
- maldi, matrix-assisted laser desorption ionization
- ms, mass spectrometry
- ms/ms, tandem mass spectrometry
- m/z, mass-to-charge ratio
- page, polyacrylamide gel electrophoresis
- sds, sodium dodecyl sulfate
- seldi, surface-enhanced laser desorption ionization
- sct, stem cell transplantation
- tof, time of flight
Collapse
|
37
|
|
38
|
Abstract
Claims that molecular markers can accurately diagnose cancer have recently been disputed; some prominent results have not been reproduced and bias has been proposed to explain the original observations. As new '-omics' fields are explored to assess molecular markers for cancer, bias will increasingly be recognized as the most important 'threat to validity' that must be addressed in the design, conduct and interpretation of such research.
Collapse
|