1
|
Van Houtven J, Cuypers B, Meysman P, Hooyberghs J, Laukens K, Valkenborg D. Constrained Standardization of Count Data from Massive Parallel Sequencing. J Mol Biol 2021; 433:166966. [PMID: 33794260 DOI: 10.1016/j.jmb.2021.166966] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 02/26/2021] [Accepted: 03/23/2021] [Indexed: 11/22/2022]
Abstract
In high-throughput omics disciplines like transcriptomics, researchers face a need to assess the quality of an experiment prior to an in-depth statistical analysis. To efficiently analyze such voluminous collections of data, researchers need triage methods that are both quick and easy to use. Such a normalization method for relative quantitation, CONSTANd, was recently introduced for isobarically-labeled mass spectra in proteomics. It transforms the data matrix of abundances through an iterative, convergent process enforcing three constraints: (I) identical column sums; (II) each row sum is fixed (across matrices) and (III) identical to all other row sums. In this study, we investigate whether CONSTANd is suitable for count data from massively parallel sequencing, by qualitatively comparing its results to those of DESeq2. Further, we propose an adjustment of the method so that it may be applied to identically balanced but differently sized experiments for joint analysis. We find that CONSTANd can process large data sets at well over 1 million count records per second whilst mitigating unwanted systematic bias and thus quickly uncovering the underlying biological structure when combined with a PCA plot or hierarchical clustering. Moreover, it allows joint analysis of data sets obtained from different batches, with different protocols and from different labs but without exploiting information from the experimental setup other than the delineation of samples into identically processed sets (IPSs). CONSTANd's simplicity and applicability to proteomics as well as transcriptomics data make it an interesting candidate for integration in multi-omics workflows.
Collapse
Affiliation(s)
- Joris Van Houtven
- Flemish Institute for Technological Research (VITO), Boeretang 200, B-2400 Mol, Belgium; Universiteit Hasselt, Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Agoralaan, Diepenbeek BE 3590, Belgium; Universiteit Antwerpen, Centre for Proteomics, Groenenborgerlaan 171, Antwerpen BE 2020, Belgium.
| | - Bart Cuypers
- Universiteit Antwerpen, Biomedical Informatics Network Antwerp (Biomina), Middelheimlaan 1, Antwerpen BE 2020, Belgium; Molecular Parasitology Unit, Institute of Tropical Medicine, Nationalestraat 155, Antwerpen BE 2020, Belgium; Universiteit Antwerpen, Adrem Data Lab, Department of Computer Sciences, Middelheimlaan 1, Antwerpen BE 2020, Belgium
| | - Pieter Meysman
- Universiteit Antwerpen, Biomedical Informatics Network Antwerp (Biomina), Middelheimlaan 1, Antwerpen BE 2020, Belgium; Universiteit Antwerpen, Adrem Data Lab, Department of Computer Sciences, Middelheimlaan 1, Antwerpen BE 2020, Belgium
| | - Jef Hooyberghs
- Flemish Institute for Technological Research (VITO), Boeretang 200, B-2400 Mol, Belgium; Universiteit Hasselt, Data Science Institute (DSI), Theoretical Physics, Agoralaan, Diepenbeek BE 3590, Belgium
| | - Kris Laukens
- Universiteit Antwerpen, Biomedical Informatics Network Antwerp (Biomina), Middelheimlaan 1, Antwerpen BE 2020, Belgium; Universiteit Antwerpen, Adrem Data Lab, Department of Computer Sciences, Middelheimlaan 1, Antwerpen BE 2020, Belgium
| | - Dirk Valkenborg
- Universiteit Hasselt, Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Agoralaan, Diepenbeek BE 3590, Belgium; Universiteit Antwerpen, Centre for Proteomics, Groenenborgerlaan 171, Antwerpen BE 2020, Belgium.
| |
Collapse
|
2
|
Van Houtven J, Hooyberghs J, Laukens K, Valkenborg D. CONSTANd: An Efficient Normalization Method for Relative Quantification in Small- and Large-Scale Omics Experiments in R BioConductor and Python. J Proteome Res 2021; 20:2151-2156. [PMID: 33703904 DOI: 10.1021/acs.jproteome.0c00977] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
For differential expression studies in all omics disciplines, data normalization is a crucial step that is often subject to a balance between speed and effectiveness. To keep up with the data produced by high-throughput instruments, researchers require fast and easy-to-use yet effective methods that fit into automated analysis pipelines. The CONSTANd normalization method meets these criteria, so we have made its source code available for R/BioConductor and Python. We briefly review the method and demonstrate how it can be used in different omics contexts for experiments of any scale. Widespread adoption across omics disciplines would ease data integration in multiomics experiments.
Collapse
Affiliation(s)
- Joris Van Houtven
- Flemish Institute for Technological Research (VITO), Boeretang 200, B-2400 Mol, Belgium.,Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Universiteit Hasselt, Agoralaan, Diepenbeek 3590, Belgium.,Adrem Data Lab, Department of Computer Sciences, Universiteit Antwerpen, Middelheimlaan 1, Antwerpen 2020, Belgium
| | - Jef Hooyberghs
- Flemish Institute for Technological Research (VITO), Boeretang 200, B-2400 Mol, Belgium.,Data Science Institute (DSI), Theoretical Physics, Universiteit Hasselt, Agoralaan, Diepenbeek 3590, Belgium
| | - Kris Laukens
- Biomedical Informatics Network Antwerp (Biomina), Universiteit Antwerpen, Middelheimlaan 1, Antwerpen 2020, Belgium.,Adrem Data Lab, Department of Computer Sciences, Universiteit Antwerpen, Middelheimlaan 1, Antwerpen 2020, Belgium
| | - Dirk Valkenborg
- Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Universiteit Hasselt, Agoralaan, Diepenbeek 3590, Belgium.,Centre for Proteomics, Universiteit Antwerpen, Groenenborgerlaan 171, Antwerpen 2020, Belgium
| |
Collapse
|
3
|
Van Houtven J, Boonen K, Baggerman G, Askenazi M, Laukens K, Hooyberghs J, Valkenborg D. PRiSM: A prototype for exhaustive, restriction-free database searching for mass spectrometry-based identification. Rapid Commun Mass Spectrom 2020:e8962. [PMID: 33009686 DOI: 10.1002/rcm.8962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 09/28/2020] [Accepted: 09/30/2020] [Indexed: 06/11/2023]
Abstract
RATIONALE The current methods for identifying peptides in mass spectral product ion data still struggle to do so for the majority of spectra. Based on the experimental setup and other assumptions, such methods restrict the search space to speed up computations, but at the cost of creating blind spots. The proteomics community would greatly benefit from a method that is capable of covering the entire search space without using any restrictions, thus establishing a baseline for identification. METHODS We conceived the "mass pattern paradigm" (MPP) that enables the creation of such an identification method, and we implemented it into a prototype database search engine "PRiSM" (PRotein-Spectrum Matching). We then assessed its operational characteristics by applying it to publicly available high-precision mass spectra of low and high identification difficulty. We used those characteristics to gain theoretical insights into trade-offs between sensitivity and speed when trying to establish a baseline for identification. RESULTS Of 100 low difficulty spectra, PRiSM and SEQUEST agree on 84 identifications (of which 75 are statistically significant). Of 15 of 100 spectra not identified in a previous study (using SEQUEST), 13 are considered reliable after visual inspection and represent 3 proteins (out of 9 in total) not detected previously. CONCLUSIONS Despite leaving noise intact, the simple PRiSM prototype can make statistically reliable identifications, while controlling the false discovery rate by fitting a null distribution. It also identifies some spectra previously unidentifiable in an "extremely open" SEQUEST search, paving the way to establishing a baseline for identification in proteomics.
Collapse
Affiliation(s)
- Joris Van Houtven
- Flemish Institute for Technological Research (VITO), Boeretang 200, Mol, Belgium
| | - Kurt Boonen
- Universiteit Hasselt, Data Science Institute (DSI), Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat), Diepenbeek, Belgium
| | - Geert Baggerman
- Universiteit Antwerpen, Centre for Proteomics, Antwerp, Belgium
| | | | - Kris Laukens
- Universiteit Antwerpen, Biomedical Informatics Network Antwerp (Biomina), Antwerp, Belgium
| | - Jef Hooyberghs
- ADReM Data Lab, Department of Computer Sciences, Universiteit Antwerpen, Antwerp, Belgium
| | - Dirk Valkenborg
- Universiteit Hasselt, Data Science Institute (DSI), Theoretical Physics, Diepenbeek, Belgium
| |
Collapse
|
4
|
Agten A, Van Houtven J, Askenazi M, Burzykowski T, Laukens K, Valkenborg D. Visualizing the agreement of peptide assignments between different search engines. J Mass Spectrom 2020; 55:e4471. [PMID: 31713933 DOI: 10.1002/jms.4471] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 10/23/2019] [Accepted: 10/28/2019] [Indexed: 06/10/2023]
Abstract
There is a trend in the analysis of shotgun proteomics data that aims to combine information from multiple search engines to increase the number of peptide annotations in an experiment. Typically, the degree of search engine complementarity and search engine agreement is visually illustrated by means of Venn diagrams that present the findings of a database search on the level of the nonredundant peptide annotations. We argue this practice to be not fit-for-purpose since the diagrams do not take into account and often conceal the information on complementarity and agreement at the level of the spectrum identification. We promote a new type of visualization that provides insight on the peptide sequence agreement at the level of the peptide-spectrum match (PSM) as a measure of consensus between two search engines with nominal outcomes. We applied the visualizations and percentage sequence agreement to an in-house data set of our benchmark organism, Caenorhabditis elegans, and illustrated that when assessing the agreement between search engine, one should disentangle the notion of PSM confidence and PSM identity. The visualizations presented in this manuscript provide a more informative assessment of pairs of search engines and are made available as an R function in the Supporting Information.
Collapse
Affiliation(s)
- Annelies Agten
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
| | - Joris Van Houtven
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
- UA-VITO Center for Proteomics, University of Antwerp, Antwerp, Belgium
- Applied Bio and Molecular Systems, Flemish Institute for Technological Research (VITO), Mol, Belgium
| | | | - Tomasz Burzykowski
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
| | - Kris Laukens
- Adrem Data Lab, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Network Antwerp (biomina), University of Antwerp, Antwerp, Belgium
| | - Dirk Valkenborg
- Interuniversity Institute of Biostatistics and Statistical Bioinformatics, Hasselt University, Hasselt, Belgium
- UA-VITO Center for Proteomics, University of Antwerp, Antwerp, Belgium
- Applied Bio and Molecular Systems, Flemish Institute for Technological Research (VITO), Mol, Belgium
| |
Collapse
|
5
|
Van Houtven J, Agten A, Boonen K, Baggerman G, Hooyberghs J, Laukens K, Valkenborg D. QCQuan: A Web Tool for the Automated Assessment of Protein Expression and Data Quality of Labeled Mass Spectrometry Experiments. J Proteome Res 2019; 18:2221-2227. [PMID: 30942071 DOI: 10.1021/acs.jproteome.9b00072] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In the context of omics disciplines and especially proteomics and biomarker discovery, the analysis of a clinical sample using label-based tandem mass spectrometry (MS) can be affected by sample preparation effects or by the measurement process itself, resulting in an incorrect outcome. Detection and correction of these mistakes using state-of-the-art methods based on mixed models can use large amounts of (computing) time. MS-based proteomics laboratories are high-throughput and need to avoid a bottleneck in their quantitative pipeline by quickly discriminating between high- and low-quality data. To this end we developed an easy-to-use web-tool called QCQuan (available at qcquan.net ) which is built around the CONSTANd normalization algorithm. It automatically provides the user with exploratory and quality control information as well as a differential expression analysis based on conservative, simple statistics. In this document we describe in detail the scientifically relevant steps that constitute the workflow and assess its qualitative and quantitative performance on three reference data sets. We find that QCQuan provides clear and accurate indications about the scientific value of both a high- and a low-quality data set. Moreover, it performed quantitatively better on a third data set than a comparable workflow assembled using established, reliable software.
Collapse
Affiliation(s)
- Joris Van Houtven
- VITO NV , Applied Bio & molecular Systems , Boeretang 200 , Mol 2400 , Belgium.,Universiteit Hasselt , Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat) , Agoralaan , Diepenbeek 3590 , Belgium.,Universiteit Antwerpen , Centre for Proteomics , Groenenborgerlaan 171 , Antwerpen 2020 , Belgium
| | - Annelies Agten
- Universiteit Hasselt , Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat) , Agoralaan , Diepenbeek 3590 , Belgium
| | - Kurt Boonen
- VITO NV , Applied Bio & molecular Systems , Boeretang 200 , Mol 2400 , Belgium.,Universiteit Antwerpen , Centre for Proteomics , Groenenborgerlaan 171 , Antwerpen 2020 , Belgium
| | - Geert Baggerman
- VITO NV , Applied Bio & molecular Systems , Boeretang 200 , Mol 2400 , Belgium.,Universiteit Antwerpen , Centre for Proteomics , Groenenborgerlaan 171 , Antwerpen 2020 , Belgium
| | - Jef Hooyberghs
- VITO NV , Applied Bio & molecular Systems , Boeretang 200 , Mol 2400 , Belgium.,Universiteit Hasselt , Theoretical Physics , Agoralaan , Diepenbeek 3590 , Belgium
| | - Kris Laukens
- Universiteit Antwerpen , Biomedical Informatics Research Center Antwerp (Biomina) , Middelheimlaan 1 , Antwerpen 2020 , Belgium.,Universiteit Antwerpen , Advanced Database Research and Modelling (ADReM), Department of Mathematics & Computer Sciences , Middelheimlaan 1 , Antwerpen 2020 , Belgium
| | - Dirk Valkenborg
- VITO NV , Applied Bio & molecular Systems , Boeretang 200 , Mol 2400 , Belgium.,Universiteit Hasselt , Interuniversity Institute for Biostatistics and Statistical Bioinformatics (I-BioStat) , Agoralaan , Diepenbeek 3590 , Belgium.,Universiteit Antwerpen , Centre for Proteomics , Groenenborgerlaan 171 , Antwerpen 2020 , Belgium
| |
Collapse
|
6
|
Abstract
In differential peptidomics, peptide profiles are compared between biological samples and the resulting expression levels are correlated to a phenotype of interest. This, in turn, allows us insight into how peptides may affect the phenotype of interest. In quantitative differential peptidomics, both label-based and label-free techniques are often employed. Label-based techniques have several advantages over label-free methods, primarily that labels allow for various samples to be pooled prior to liquid chromatography-mass spectrometry (LC-MS) analysis, reducing between-run variation. Here, we detail a method for performing quantitative peptidomics using stable amine-binding isotopic and isobaric tags.
Collapse
Affiliation(s)
- Kurt Boonen
- Research Group of Functional Genomics and Proteomics, Department of Biology, KU Leuven, Leuven, Belgium
| | - Wouter De Haes
- Research Group of Functional Genomics and Proteomics, Department of Biology, KU Leuven, Leuven, Belgium
- Research Group of Molecular and Functional Neurobiology, Department of Biology, KU Leuven, Leuven, Belgium
| | - Joris Van Houtven
- Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics Department, KU Leuven, Leuven, Belgium
| | - Rik Verdonck
- Research Group of Molecular Developmental Physiology and Signal Transduction, Department of Biology, KU Leuven, Leuven, Belgium
| | - Geert Baggerman
- Center for Proteomics, University of Antwerp, Antwerp, Belgium
| | - Dirk Valkenborg
- Center for Proteomics, University of Antwerp, Antwerp, Belgium
- Interuniversity Institute for Biostatistics and Statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium
| | - Liliane Schoofs
- Research Group of Functional Genomics and Proteomics, Department of Biology, KU Leuven, Leuven, Belgium.
| |
Collapse
|