1
|
Sakthivel K, Lal SB, Srivastava S, Chaturvedi KK, Khan YJ, Mishra DC, Madival SD, Vaidhyanathan R, Jha GK. A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data. J Proteome Res 2025; 24:158-170. [PMID: 39659155 DOI: 10.1021/acs.jproteome.4c00552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2024]
Abstract
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set's suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named 'lfproQC' and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
Collapse
Affiliation(s)
- Kabilan Sakthivel
- The Graduate School, ICAR-Indian Agricultural Research Institute, New Delhi 110012, India
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Shashi Bhushan Lal
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Sudhir Srivastava
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Krishna Kumar Chaturvedi
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Yasin Jeshima Khan
- Division of Genomic Resources, ICAR-National Bureau of Plant Genetic Resources, New Delhi 110012, India
| | - Dwijesh Chandra Mishra
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | - Sharanbasappa D Madival
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| | | | - Girish Kumar Jha
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India
| |
Collapse
|
2
|
Schumann Y, Gocke A, Neumann JE. Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets. Proteomics 2025; 25:e202400100. [PMID: 39740174 DOI: 10.1002/pmic.202400100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 11/08/2024] [Accepted: 11/26/2024] [Indexed: 01/02/2025]
Abstract
Molecular profiling of different omic-modalities (e.g., DNA methylomics, transcriptomics, proteomics) in biological systems represents the basis for research and clinical decision-making. Measurement-specific biases, so-called batch effects, often hinder the integration of independently acquired datasets, and missing values further hamper the applicability of typical data processing algorithms. In addition to careful experimental design, well-defined standards in data acquisition and data exchange, the alleviation of these phenomena particularly requires a dedicated data integration and preprocessing pipeline. This review aims to give a comprehensive overview of computational methods for data integration and missing value imputation for omic data analyses. We provide formal definitions for missing value mechanisms and propose a novel statistical taxonomy for batch effects, especially in the presence of missing data. Based on an automated document search and systematic literature review, we describe 32 distinct data integration methods from five main methodological categories, as well as 37 algorithms for missing value imputation from five separate categories. Additionally, this review highlights multiple quantitative evaluation methods to aid researchers in selecting a suitable set of methods for their work. Finally, this work provides an integrated discussion of the relevance of batch effects and missing values in omics with corresponding method recommendations. We then propose a comprehensive three-step workflow from the study conception to final data analysis and deduce perspectives for future research. Eventually, we present a comprehensive flow chart as well as exemplary decision trees to aid practitioners in the selection of specific approaches for imputation and data integration in their studies.
Collapse
Affiliation(s)
- Yannis Schumann
- IT-Department, Deutsches Elektronen-Synchroton DESY, Hamburg, Germany
| | - Antonia Gocke
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Core Facility Mass Spectrometric Proteomics, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| | - Julia E Neumann
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Institute of Neuropathology, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| |
Collapse
|
3
|
Etourneau L, Fancello L, Wieczorek S, Varoquaux N, Burger T. Penalized likelihood optimization for censored missing value imputation in proteomics. Biostatistics 2024; 26:kxaf006. [PMID: 40120089 DOI: 10.1093/biostatistics/kxaf006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 01/31/2025] [Accepted: 02/03/2025] [Indexed: 03/25/2025] Open
Abstract
Label-free bottom-up proteomics using mass spectrometry and liquid chromatography has long been established as one of the most popular high-throughput analysis workflows for proteome characterization. However, it produces data hindered by complex and heterogeneous missing values, which imputation has long remained problematic. To cope with this, we introduce Pirat, an algorithm that harnesses this challenge using an original likelihood maximization strategy. Notably, it models the instrument limit by learning a global censoring mechanism from the data available. Moreover, it estimates the covariance matrix between enzymatic cleavage products (ie peptides or precursor ions), while offering a natural way to integrate complementary transcriptomic information when multi-omic assays are available. Our benchmarking on several datasets covering a variety of experimental designs (number of samples, acquisition mode, missingness patterns, etc.) and using a variety of metrics (differential analysis ground truth or imputation errors) shows that Pirat outperforms all pre-existing imputation methods. Beyond the interest of Pirat as an imputation tool, these results pinpoint the need for a paradigm change in proteomics imputation, as most pre-existing strategies could be boosted by incorporating similar models to account for the instrument censorship or for the correlation structures, either grounded to the analytical pipeline or arising from a multi-omic approach.
Collapse
Affiliation(s)
- Lucas Etourneau
- Univ. Grenoble Alpes, CNRS, CEA, INSERM, BGE UA13, ProFI FR2048, EDyP, Bâtiment 42b, CEA de Grenoble, 17 avenue des Martyrs, 38054 Grenoble Cedex 9, France
- TIMC, Univ. Grenoble Alpes, CNRS, Grenoble INP, Laboratoire TIMC, Rond-Point de la Croix de Vie, 38700 La Tronche, France
| | - Laura Fancello
- Univ. Grenoble Alpes, CNRS, CEA, INSERM, BGE UA13, ProFI FR2048, EDyP, Bâtiment 42b, CEA de Grenoble, 17 avenue des Martyrs, 38054 Grenoble Cedex 9, France
| | - Samuel Wieczorek
- Univ. Grenoble Alpes, CNRS, CEA, INSERM, BGE UA13, ProFI FR2048, EDyP, Bâtiment 42b, CEA de Grenoble, 17 avenue des Martyrs, 38054 Grenoble Cedex 9, France
| | - Nelle Varoquaux
- TIMC, Univ. Grenoble Alpes, CNRS, Grenoble INP, Laboratoire TIMC, Rond-Point de la Croix de Vie, 38700 La Tronche, France
| | - Thomas Burger
- Univ. Grenoble Alpes, CNRS, CEA, INSERM, BGE UA13, ProFI FR2048, EDyP, Bâtiment 42b, CEA de Grenoble, 17 avenue des Martyrs, 38054 Grenoble Cedex 9, France
| |
Collapse
|
4
|
Ryan-Despraz J, Wissler A. Imputation methods for mixed datasets in bioarchaeology. ARCHAEOLOGICAL AND ANTHROPOLOGICAL SCIENCES 2024; 16:187. [PMID: 39450370 PMCID: PMC11496361 DOI: 10.1007/s12520-024-02078-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 09/16/2024] [Indexed: 10/26/2024]
Abstract
Missing data is a prevalent problem in bioarchaeological research and imputation could provide a promising solution. This work simulated missingness on a control dataset (481 samples × 41 variables) in order to explore imputation methods for mixed data (qualitative and quantitative data). The tested methods included Random Forest (RF), PCA/MCA, factorial analysis for mixed data (FAMD), hotdeck, predictive mean matching (PMM), random samples from observed values (RSOV), and a multi-method (MM) approach for the three missingness mechanisms (MCAR, MAR, and MNAR) at levels of 5%, 10%, 20%, 30%, and 40% missingness. This study also compared single imputation with an adapted multiple imputation method derived from the R package "mice". The results showed that the adapted multiple imputation technique always outperformed single imputation for the same method. The best performing methods were most often RF and MM, and other commonly successful methods were PCA/MCA and PMM multiple imputation. Across all criteria, the amount of missingness was the most important parameter for imputation accuracy. While this study found that some imputation methods performed better than others for the control dataset, each imputation method has advantages and disadvantages. Imputation remains a promising solution for datasets containing missingness; however when making a decision it is essential to consider dataset structure and research goals. Supplementary Information The online version contains supplementary material available at 10.1007/s12520-024-02078-2.
Collapse
Affiliation(s)
| | - Amanda Wissler
- Department of Anthropology, McMaster University, Hamilton, Canada
| |
Collapse
|
5
|
Chungnoy K, Tanantong T, Songmuang P. Missing value imputation on gene expression data using bee-based algorithm to improve classification performance. PLoS One 2024; 19:e0305492. [PMID: 39208345 PMCID: PMC11361674 DOI: 10.1371/journal.pone.0305492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Accepted: 05/28/2024] [Indexed: 09/04/2024] Open
Abstract
Existing missing value imputation methods focused on imputing the data regarding actual values towards a completion of datasets as an input for machine learning tasks. This work proposes an imputation of missing values towards improvement of accuracy performance for classification. The proposed method was based on bee algorithm and the use of k-nearest neighborhood with linear regression to guide on finding the appropriate solution in prevention of randomness. Among the processes, GINI importance score was utilized in selecting values for imputation. The imputed values thus reflected on improving a discriminative power in classification tasks instead of replicating the actual values from the original dataset. In this study, we evaluated the proposed method against frequently used imputation methods such as k-nearest neighborhood, principal components analysis, nonlinear principal, and component analysis to compare root mean square error results and accuracy of using imputed datasets in a classification task. The experimental results indicated that our proposed method obtained the best accuracy results from all datasets comparing to other methods. In comparison to original dataset, the classification model from imputed datasets yielded 15-25% higher accuracy in class prediction. From analysis, the results showed that feature ranking used in a classification process was affected and lead to noticeably change in informativeness as the imputed data from the proposed method played the role to boost a discriminating power.
Collapse
Affiliation(s)
- Kritanat Chungnoy
- Department of Computer Science, Faculty of Science and Technology, Thammasat University (Rangsit Campus), Pathum Thani, Thailand
| | - Tanatorn Tanantong
- Department of Computer Science, Faculty of Science and Technology, Thammasat University (Rangsit Campus), Pathum Thani, Thailand
- Thammasat University Research Unit in Data Innovation and Artificial Intelligence, Thammasat University (Rangsit Campus), Pathum Thani, Thailand
| | - Pokpong Songmuang
- Department of Computer Science, Faculty of Science and Technology, Thammasat University (Rangsit Campus), Pathum Thani, Thailand
- Thammasat University Research Unit in Data Innovation and Artificial Intelligence, Thammasat University (Rangsit Campus), Pathum Thani, Thailand
| |
Collapse
|
6
|
Manis G, Platakis D, Sassi R. Sample Entropy Computation on Signals with Missing Values. ENTROPY (BASEL, SWITZERLAND) 2024; 26:704. [PMID: 39202174 PMCID: PMC11353543 DOI: 10.3390/e26080704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/03/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024]
Abstract
Sample entropy embeds time series into m-dimensional spaces and estimates entropy based on the distances between points in these spaces. However, when samples can be considered as missing or invalid, defining distance in the embedding space becomes problematic. Preprocessing techniques, such as deletion or interpolation, can be employed as a solution, producing time series without missing or invalid values. While deletion ignores missing values, interpolation replaces them using approximations based on neighboring points. This paper proposes a novel approach for the computation of sample entropy when values are considered as missing or invalid. The proposed algorithm accommodates points in the m-dimensional space and handles them there. A theoretical and experimental comparison of the proposed algorithm with deletion and interpolation demonstrates several advantages over these other two approaches. Notably, the deviation of the expected sample entropy value for the proposed methodology consistently proves to be lowest one.
Collapse
Affiliation(s)
- George Manis
- Department of Computer Science and Engineering, University of Ioannina, 45500 Ioannina, Greece;
| | - Dimitrios Platakis
- Department of Computer Science and Engineering, University of Ioannina, 45500 Ioannina, Greece;
| | - Roberto Sassi
- Dipartimento di Informatica, Università degli Studi di Milano, 20133 Milano, Italy
| |
Collapse
|
7
|
Lane RE, Korbie D, Khanna KK, Mohamed A, Hill MM, Trau M. Defining the relationship between cellular and extracellular vesicle (EV) content in breast cancer via an integrative multi-omic analysis. Proteomics 2024; 24:e2300089. [PMID: 38168906 DOI: 10.1002/pmic.202300089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 11/16/2023] [Accepted: 11/17/2023] [Indexed: 01/05/2024]
Abstract
Much recent research has been dedicated to exploring the utility of extracellular vesicles (EVs) as circulating disease biomarkers. Underpinning this work is the assumption that the molecular cargo of EVs directly reflects the originating cell. Few attempts have been made, however, to empirically validate this on the -omic level. To this end, we have performed an integrative multi-omic analysis of a panel of breast cancer cell lines and corresponding EVs. Whole transcriptome analysis validated that the cellular transcriptome remained stable when cultured cells are transitioned to low serum or serum-free medium for EV collection. Transcriptomic profiling of the isolated EVs indicated a positive correlation between transcript levels in cells and EVs, including disease-associated transcripts. Analysis of the EV proteome verified that HER2 protein is present in EVs, however neither the estrogen (ER) nor progesterone (PR) receptor proteins are detected regardless of cellular expression. Using multivariate analysis, we derived an EV protein signature to infer cellular patterns of ER and HER2 expression, though the ER protein could not be directly detected. Integrative analyses affirmed that the EV proteome and transcriptome captured key phenotypic hallmarks of the originating cells, supporting the potential of EVs for non-invasive monitoring of breast cancers.
Collapse
Affiliation(s)
- Rebecca E Lane
- Australian Institute for Bioengineering and Nanotechnology, Centre for Personalised Nanomedicine, The University of Queensland, St Lucia, Queensland, Australia
| | - Darren Korbie
- Australian Institute for Bioengineering and Nanotechnology, Centre for Personalised Nanomedicine, The University of Queensland, St Lucia, Queensland, Australia
| | - Kum Kum Khanna
- QIMR Berghofer Medical Research Institute, Herston, Queensland, Australia
| | - Ahmed Mohamed
- QIMR Berghofer Medical Research Institute, Herston, Queensland, Australia
| | - Michelle M Hill
- QIMR Berghofer Medical Research Institute, Herston, Queensland, Australia
| | - Matt Trau
- Australian Institute for Bioengineering and Nanotechnology, Centre for Personalised Nanomedicine, The University of Queensland, St Lucia, Queensland, Australia
- School of Chemistry and Molecular Biosciences, The University of Queensland, St Lucia, Queensland, Australia
| |
Collapse
|
8
|
Li Q, Button-Simons KA, Sievert MAC, Chahoud E, Foster GF, Meis K, Ferdig MT, Milenković T. Enhancing Gene Co-Expression Network Inference for the Malaria Parasite Plasmodium falciparum. Genes (Basel) 2024; 15:685. [PMID: 38927622 PMCID: PMC11202799 DOI: 10.3390/genes15060685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 05/22/2024] [Accepted: 05/22/2024] [Indexed: 06/28/2024] Open
Abstract
BACKGROUND Malaria results in more than 550,000 deaths each year due to drug resistance in the most lethal Plasmodium (P.) species P. falciparum. A full P. falciparum genome was published in 2002, yet 44.6% of its genes have unknown functions. Improving the functional annotation of genes is important for identifying drug targets and understanding the evolution of drug resistance. RESULTS Genes function by interacting with one another. So, analyzing gene co-expression networks can enhance functional annotations and prioritize genes for wet lab validation. Earlier efforts to build gene co-expression networks in P. falciparum have been limited to a single network inference method or gaining biological understanding for only a single gene and its interacting partners. Here, we explore multiple inference methods and aim to systematically predict functional annotations for all P. falciparum genes. We evaluate each inferred network based on how well it predicts existing gene-Gene Ontology (GO) term annotations using network clustering and leave-one-out crossvalidation. We assess overlaps of the different networks' edges (gene co-expression relationships), as well as predicted functional knowledge. The networks' edges are overall complementary: 47-85% of all edges are unique to each network. In terms of the accuracy of predicting gene functional annotations, all networks yielded relatively high precision (as high as 87% for the network inferred using mutual information), but the highest recall reached was below 15%. All networks having low recall means that none of them capture a large amount of all existing gene-GO term annotations. In fact, their annotation predictions are highly complementary, with the largest pairwise overlap of only 27%. We provide ranked lists of inferred gene-gene interactions and predicted gene-GO term annotations for future use and wet lab validation by the malaria community. CONCLUSIONS The different networks seem to capture different aspects of the P. falciparum biology in terms of both inferred interactions and predicted gene functional annotations. Thus, relying on a single network inference method should be avoided when possible. SUPPLEMENTARY DATA Attached.
Collapse
Affiliation(s)
- Qi Li
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA
- Lucy Family Institute for Data & Society, University of Notre Dame, Notre Dame, IN 46556, USA (M.T.F.)
| | - Katrina A. Button-Simons
- Lucy Family Institute for Data & Society, University of Notre Dame, Notre Dame, IN 46556, USA (M.T.F.)
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Mackenzie A. C. Sievert
- Lucy Family Institute for Data & Society, University of Notre Dame, Notre Dame, IN 46556, USA (M.T.F.)
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Elias Chahoud
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
- Department of Preprofessional Studies, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Gabriel F. Foster
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Kaitlynn Meis
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Michael T. Ferdig
- Lucy Family Institute for Data & Society, University of Notre Dame, Notre Dame, IN 46556, USA (M.T.F.)
- Department of Biological Sciences, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA
- Lucy Family Institute for Data & Society, University of Notre Dame, Notre Dame, IN 46556, USA (M.T.F.)
| |
Collapse
|
9
|
Gong Y, Ding W, Wang P, Wu Q, Yao X, Yang Q. Evaluating Machine Learning Methods of Analyzing Multiclass Metabolomics. J Chem Inf Model 2023; 63:7628-7641. [PMID: 38079572 DOI: 10.1021/acs.jcim.3c01525] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2023]
Abstract
Multiclass metabolomic studies have become popular for revealing the differences in multiple stages of complex diseases, various lifestyles, or the effects of specific treatments. In multiclass metabolomics, there are multiple data manipulation steps for analyzing raw data, which consist of data filtering, the imputation of missing values, data normalization, marker identification, sample separation, classification, and so on. In each step, several to dozens of machine learning methods can be chosen for the given data set, with potentially hundreds or thousands of method combinations in the whole data processing chain. Therefore, a clear understanding of these machine learning methods is helpful for selecting an appropriate method combination for obtaining stable and reliable analytical results of specific data. However, there has rarely been an overall introduction or evaluation of these methods based on multiclass metabolomic data. Herein, detailed descriptions of these machine learning methods in multiple data manipulation steps are reviewed. Moreover, an assessment of these methods was performed using a benchmark data set for multiclass metabolomics. First, 12 imputation methods for imputing missing values were evaluated based on the PSS (Procrustes statistical shape analysis) and NRMSE (normalized root-mean-square error) values. Second, 17 normalization methods for processing multiclass metabolomic data were evaluated by applying the PMAD (pooled median absolute deviation) value. Third, different methods of identifying markers of multiclass metabolomics were evaluated based on the CWrel (relative weighted consistency) value. Fourth, nine classification methods for constructing multiclass models were assessed using the AUC (area under the curve) value. Performance evaluations of machine learning methods are highly recommended to select the most appropriate method combination before performing the final analysis of the given data. Overall, detailed descriptions and evaluation of various machine learning methods are expected to improve analyses of multiclass metabolomic data.
Collapse
Affiliation(s)
- Yaguo Gong
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Wei Ding
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Panpan Wang
- College of Chemistry and Pharmaceutical Engineering, Huanghuai University, Zhumadian 463000, China
| | - Qibiao Wu
- State Key Laboratory of Quality Research in Chinese Medicine, School of Pharmacy, Macau University of Science and Technology, Macau 999078, China
| | - Xiaojun Yao
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- Department of Bioinformatics, School of Geographic and Biologic Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
| |
Collapse
|
10
|
Jung M, Zimmermann R. Quantitative Mass Spectrometry Characterizes Client Spectra of Components for Targeting of Membrane Proteins to and Their Insertion into the Membrane of the Human ER. Int J Mol Sci 2023; 24:14166. [PMID: 37762469 PMCID: PMC10532041 DOI: 10.3390/ijms241814166] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 09/07/2023] [Accepted: 09/12/2023] [Indexed: 09/29/2023] Open
Abstract
To elucidate the redundancy in the components for the targeting of membrane proteins to the endoplasmic reticulum (ER) and/or their insertion into the ER membrane under physiological conditions, we previously analyzed different human cells by label-free quantitative mass spectrometry. The HeLa and HEK293 cells had been depleted of a certain component by siRNA or CRISPR/Cas9 treatment or were deficient patient fibroblasts and compared to the respective control cells by differential protein abundance analysis. In addition to clients of the SRP and Sec61 complex, we identified membrane protein clients of components of the TRC/GET, SND, and PEX3 pathways for ER targeting, and Sec62, Sec63, TRAM1, and TRAP as putative auxiliary components of the Sec61 complex. Here, a comprehensive evaluation of these previously described differential protein abundance analyses, as well as similar analyses on the Sec61-co-operating EMC and the characteristics of the topogenic sequences of the various membrane protein clients, i.e., the client spectra of the components, are reported. As expected, the analysis characterized membrane protein precursors with cleavable amino-terminal signal peptides or amino-terminal transmembrane helices as predominant clients of SRP, as well as the Sec61 complex, while precursors with more central or even carboxy-terminal ones were found to dominate the client spectra of the SND and TRC/GET pathways for membrane targeting. For membrane protein insertion, the auxiliary Sec61 channel components indeed share the client spectra of the Sec61 complex to a large extent. However, we also detected some unexpected differences, particularly related to EMC, TRAP, and TRAM1. The possible mechanistic implications for membrane protein biogenesis at the human ER are discussed and can be expected to eventually advance our understanding of the mechanisms that are involved in the so-called Sec61-channelopathies, resulting from deficient ER protein import.
Collapse
Affiliation(s)
| | - Richard Zimmermann
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany;
| |
Collapse
|
11
|
Kong W, Wong BJH, Hui HWH, Lim KP, Wang Y, Wong L, Goh WWB. ProJect: a powerful mixed-model missing value imputation method. Brief Bioinform 2023:bbad233. [PMID: 37419612 DOI: 10.1093/bib/bbad233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 05/24/2023] [Accepted: 06/05/2023] [Indexed: 07/09/2023] Open
Abstract
Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https://github.com/miaomiao6606/ProJect.
Collapse
Affiliation(s)
- Weijia Kong
- School of Biological Sciences, Nanyang Technological University, Singapore
- Department of Computer Science, National University of Singapore, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | | | | | - Kai Peng Lim
- School of Biological Sciences, Nanyang Technological University, Singapore
| | - Yulan Wang
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, Singapore
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore
- Center for Biomedical Informatics, Nanyang Technological University, Singapore
| |
Collapse
|
12
|
Dutt M, Hartel G, Richards RS, Shah AK, Mohamed A, Apostolidou S, Gentry‐Maharaj A, Australian Ovarian Cancer Study Group, Hooper JD, Perrin LC, Menon U, Hill MM. Discovery and validation of serum glycoprotein biomarkers for high grade serous ovarian cancer. Proteomics Clin Appl 2023; 17:e2200114. [PMID: 37147936 PMCID: PMC7615076 DOI: 10.1002/prca.202200114] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 04/06/2023] [Accepted: 04/27/2023] [Indexed: 05/07/2023]
Abstract
PURPOSE This study aimed to identify serum glycoprotein biomarkers for early detection of high-grade serous ovarian cancer (HGSOC), the most common and aggressive histotype of ovarian cancer. EXPERIMENTAL DESIGN The glycoproteomics pipeline lectin magnetic bead array (LeMBA)-mass spectrometry (MS) was used in age-matched case-control serum samples. Clinical samples collected at diagnosis were divided into discovery (n = 30) and validation (n = 98) sets. We also analysed a set of preclinical sera (n = 30) collected prior to HGSOC diagnosis in the UK Collaborative Trial of Ovarian Cancer Screening. RESULTS A 7-lectin LeMBA-MS/MS discovery screen shortlisted 59 candidate proteins and three lectins. Validation analysis using 3-lectin LeMBA-multiple reaction monitoring (MRM) confirmed elevated A1AT, AACT, CO9, HPT and ITIH3 and reduced A2MG, ALS, IBP3 and PON1 glycoforms in HGSOC. The best performing multimarker signature had 87.7% area under the receiver operating curve, 90.7% specificity and 70.4% sensitivity for distinguishing HGSOC from benign and healthy groups. In the preclinical set, CO9, ITIH3 and A2MG glycoforms were altered in samples collected 11.1 ± 5.1 months prior to HGSOC diagnosis, suggesting potential for early detection. CONCLUSIONS AND CLINICAL RELEVANCE Our findings provide evidence of candidate early HGSOC serum glycoprotein biomarkers, laying the foundation for further study in larger cohorts.
Collapse
Affiliation(s)
- Mriga Dutt
- QIMR Berghofer Medical Research InstituteBrisbaneQLDAustralia
| | - Gunter Hartel
- QIMR Berghofer Medical Research InstituteBrisbaneQLDAustralia
| | | | - Alok K. Shah
- QIMR Berghofer Medical Research InstituteBrisbaneQLDAustralia
| | - Ahmed Mohamed
- QIMR Berghofer Medical Research InstituteBrisbaneQLDAustralia
| | - Sophia Apostolidou
- MRC Clinical Trials UnitInstitute of Clinical Trials and Methodology, University College LondonLondonUK
| | - Aleksandra Gentry‐Maharaj
- MRC Clinical Trials UnitInstitute of Clinical Trials and Methodology, University College LondonLondonUK
| | | | - John D. Hooper
- Mater Research Institute – The University of QueenslandTranslational Research InstituteWoolloongabbaQLDAustralia
| | - Lewis C. Perrin
- Mater Research Institute – The University of QueenslandTranslational Research InstituteWoolloongabbaQLDAustralia
- Mater Adult HospitalSouth BrisbaneQLDAustralia
| | - Usha Menon
- MRC Clinical Trials UnitInstitute of Clinical Trials and Methodology, University College LondonLondonUK
| | - Michelle M. Hill
- QIMR Berghofer Medical Research InstituteBrisbaneQLDAustralia
- UQ Centre for Clinical ResearchFaculty of MedicineThe University of QueenslandBrisbaneAustralia
| |
Collapse
|
13
|
Sun W, He Q, Liu J, Xiao X, Wu Y, Zhou S, Ma S, Wang R. Dynamic monitoring of maize grain quality based on remote sensing data. FRONTIERS IN PLANT SCIENCE 2023; 14:1177477. [PMID: 37426960 PMCID: PMC10325687 DOI: 10.3389/fpls.2023.1177477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 05/31/2023] [Indexed: 07/11/2023]
Abstract
Remote sensing data have been widely used to monitor crop development, grain yield, and quality, while precise monitoring of quality traits, especially grain starch and oil contents considering meteorological elements, still needs to be improved. In this study, the field experiment with different sowing time, i.e., 8 June, 18 June, 28 June, and 8 July, was conducted in 2018-2020. The scalable annual and inter-annual quality prediction model for summer maize in different growth periods was established using hierarchical linear modeling (HLM), which combined hyperspectral and meteorological data. Compared with the multiple linear regression (MLR) using vegetation indices (VIs), the prediction accuracy of HLM was obviously improved with the highest R 2, root mean square error (RMSE), and mean absolute error (MAE) values of 0.90, 0.10, and 0.08, respectively (grain starch content (GSC)); 0.87, 0.10, and 0.08, respectively (grain protein content (GPC)); and 0.74, 0.13, and 0.10, respectively (grain oil content (GOC)). In addition, the combination of the tasseling, grain-filling, and maturity stages further improved the predictive power for GSC (R 2 = 0.96). The combination of the grain-filling and maturity stages further improved the predictive power for GPC (R 2 = 0.90). The prediction accuracy developed in the combination of the jointing and tasseling stages for GOC (R 2 = 0.85). The results also showed that meteorological factors, especially precipitation, had a great influence on grain quality monitoring. Our study provided a new idea for crop quality monitoring by remote sensing.
Collapse
Affiliation(s)
- Weiwei Sun
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| | - Qijin He
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
- Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters (CIC-FEMD), Nanjing University of Information Science and Technology, Nanjing, China
| | - Jiahong Liu
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| | - Xiao Xiao
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| | - Yaxin Wu
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| | - Sijia Zhou
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| | - Selimai Ma
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| | - Rongwan Wang
- College of Resources and Environmental Sciences, China Agricultural University, Beijing, China
| |
Collapse
|
14
|
Wu E, Trevino AE, Wu Z, Swanson K, Kim HJ, D’Angio HB, Preska R, Chiou AE, Charville GW, Dalerba P, Duvvuri U, Colevas AD, Levi J, Bedi N, Chang S, Sunwoo J, Egloff AM, Uppaluri R, Mayer AT, Zou J. 7-UP: Generating in silico CODEX from a small set of immunofluorescence markers. PNAS NEXUS 2023; 2:pgad171. [PMID: 37275261 PMCID: PMC10236358 DOI: 10.1093/pnasnexus/pgad171] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 05/15/2023] [Indexed: 06/07/2023]
Abstract
Multiplex immunofluorescence (mIF) assays multiple protein biomarkers on a single tissue section. Recently, high-plex CODEX (co-detection by indexing) systems enable simultaneous imaging of 40+ protein biomarkers, unlocking more detailed molecular phenotyping, leading to richer insights into cellular interactions and disease. However, high-plex data can be slower and more costly to collect, limiting its applications, especially in clinical settings. We propose a machine learning framework, 7-UP, that can computationally generate in silico 40-plex CODEX at single-cell resolution from a standard 7-plex mIF panel by leveraging cellular morphology. We demonstrate the usefulness of the imputed biomarkers in accurately classifying cell types and predicting patient survival outcomes. Furthermore, 7-UP's imputations generalize well across samples from different clinical sites and cancer types. 7-UP opens the possibility of in silico CODEX, making insights from high-plex mIF more widely available.
Collapse
Affiliation(s)
| | | | - Zhenqin Wu
- Enable Medicine, Menlo Park, CA 94025, USA
- Department of Chemistry, Stanford University, Stanford, CA 94305, USA
| | - Kyle Swanson
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | | | | | | | | | | | - Piero Dalerba
- Department of Pathology and Cell Biology, Columbia University, New York, NY 10027, USA
| | - Umamaheswar Duvvuri
- Department of Otolaryngology, University of Pittsburgh, Pittsburgh, PA 15213, USA
| | | | - Jelena Levi
- CellSight Technologies, San Francisco, CA 94107, USA
| | - Nikita Bedi
- Department of Otolaryngology-Head and Neck Surgery, Stanford University, Stanford, CA 94305, USA
| | - Serena Chang
- Department of Otolaryngology-Head and Neck Surgery, Stanford University, Stanford, CA 94305, USA
| | - John Sunwoo
- Department of Otolaryngology-Head and Neck Surgery, Stanford University, Stanford, CA 94305, USA
| | - Ann Marie Egloff
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Ravindra Uppaluri
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Aaron T Mayer
- To whom correspondence should be addressed: (A.E.T.); (A.T.M.); (J.Z.)
| | - James Zou
- To whom correspondence should be addressed: (A.E.T.); (A.T.M.); (J.Z.)
| |
Collapse
|
15
|
Sapashnik D, Newman R, Pietras CM, Zhou D, Devkota K, Qu F, Kofman L, Boudreau S, Fried I, Slonim DK. Cell-specific imputation of drug connectivity mapping with incomplete data. PLoS One 2023; 18:e0278289. [PMID: 36795645 PMCID: PMC9934325 DOI: 10.1371/journal.pone.0278289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 11/15/2022] [Indexed: 02/17/2023] Open
Abstract
Drug repositioning allows expedited discovery of new applications for existing compounds, but re-screening vast compound libraries is often prohibitively expensive. "Connectivity mapping" is a process that links drugs to diseases by identifying compounds whose impact on expression in a collection of cells reverses the disease's impact on expression in disease-relevant tissues. The LINCS project has expanded the universe of compounds and cells for which data are available, but even with this effort, many clinically useful combinations are missing. To evaluate the possibility of repurposing drugs despite missing data, we compared collaborative filtering using either neighborhood-based or SVD imputation methods to two naive approaches via cross-validation. Methods were evaluated for their ability to predict drug connectivity despite missing data. Predictions improved when cell type was taken into account. Neighborhood collaborative filtering was the most successful method, with the best improvements in non-immortalized primary cells. We also explored which classes of compounds are most and least reliant on cell type for accurate imputation. We conclude that even for cells in which drug responses have not been fully characterized, it is possible to identify unassayed drugs that reverse in those cells the expression signatures observed in disease.
Collapse
Affiliation(s)
- Diana Sapashnik
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Rebecca Newman
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | | | - Di Zhou
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Kapil Devkota
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Fangfang Qu
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Lior Kofman
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Sean Boudreau
- Department of Computer Science, Tufts University, Medford, MA, United States of America
| | - Inbar Fried
- Department of Medicine, University of North Carolina School of Medicine, Chapel Hill, NC, United States of America
| | - Donna K. Slonim
- Department of Computer Science, Tufts University, Medford, MA, United States of America
- Department of Immunology, Tufts University School of Medicine, Boston, MA, United States of America
- * E-mail:
| |
Collapse
|
16
|
Fan S, Wilson CM, Fridley BL, Li Q. Statistics and Machine Learning in Mass Spectrometry-Based Metabolomics Analysis. Methods Mol Biol 2023; 2629:247-269. [PMID: 36929081 DOI: 10.1007/978-1-0716-2986-4_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
In this chapter, we review the cutting-edge statistical and machine learning methods for missing value imputation, normalization, and downstream analyses in mass spectrometry metabolomics studies, with illustration by example datasets. The missing peak recovery includes simple imputation by zero or limit of detection, regression-based or distribution-based imputation, and prediction by random forest. The batch effect can be removed by data-driven methods, internal standard-based, and quality control sample-based normalization. We also summarize different types of statistical analysis for metabolomics and clinical outcomes, such as inference on metabolic biomarkers, clustering of metabolomic profiles, metabolite module building, and integrative analysis with transcriptome.
Collapse
Affiliation(s)
- Sili Fan
- Graduate Group of Biostatistics, University of California, Davis, CA, USA
| | - Christopher M Wilson
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA
| | - Brooke L Fridley
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL, USA
| | - Qian Li
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, USA.
| |
Collapse
|
17
|
Qureshi R, Zou B, Alam T, Wu J, Lee VHF, Yan H. Computational Methods for the Analysis and Prediction of EGFR-Mutated Lung Cancer Drug Resistance: Recent Advances in Drug Design, Challenges and Future Prospects. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:238-255. [PMID: 35007197 DOI: 10.1109/tcbb.2022.3141697] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Lung cancer is a major cause of cancer deaths worldwide, and has a very low survival rate. Non-small cell lung cancer (NSCLC) is the largest subset of lung cancers, which accounts for about 85% of all cases. It has been well established that a mutation in the epidermal growth factor receptor (EGFR) can lead to lung cancer. EGFR Tyrosine Kinase Inhibitors (TKIs) are developed to target the kinase domain of EGFR. These TKIs produce promising results at the initial stage of therapy, but the efficacy becomes limited due to the development of drug resistance. In this paper, we provide a comprehensive overview of computational methods, for understanding drug resistance mechanisms. The important EGFR mutants and the different generations of EGFR-TKIs, with the survival and response rates are discussed. Next, we evaluate the role of important EGFR parameters in drug resistance mechanism, including structural dynamics, hydrogen bonds, stability, dimerization, binding free energies, and signaling pathways. Personalized drug resistance prediction models, drug response curve, drug synergy, and other data-driven methods are also discussed. Recent advancements in deep learning; such as AlphaFold2, deep generative models, big data analytics, and the applications of statistics and permutation are also highlighted. We explore limitations in the current methodologies, and discuss strategies to overcome them. We believe this review will serve as a reference for researchers; to apply computational techniques for precision medicine, analyzing structures of protein-drug complexes, drug discovery, and understanding the drug response and resistance mechanisms in lung cancer patients.
Collapse
|
18
|
Kong W, Hui HWH, Peng H, Goh WWB. Dealing with missing values in proteomics data. Proteomics 2022; 22:e2200092. [PMID: 36349819 DOI: 10.1002/pmic.202200092] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022]
Abstract
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.
Collapse
Affiliation(s)
- Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.,Centre for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
19
|
Li H, Cao Q, Bai Q, Li Z, Hu H. Multistate time series imputation using generative adversarial network with applications to traffic data. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07961-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
20
|
Dubey A, Rasool A. Usage of Clustering and Weighted Nearest Neighbors for Efficient Missing Data Imputation of Microarray Gene Expression Dataset. ADVANCED THEORY AND SIMULATIONS 2022. [DOI: 10.1002/adts.202200460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Aditya Dubey
- Department of Computer Science and Engineering Maulana Azad National Institute of Technology Bhopal 462003 India
| | - Akhtar Rasool
- Department of Computer Science and Engineering Maulana Azad National Institute of Technology Bhopal 462003 India
| |
Collapse
|
21
|
Soemartojo SM, Siswantining T, Fernando Y, Sarwinda D, Al-Ash HS, Syarofina S, Saputra N. Iterative bicluster-based Bayesian principal component analysis and least squares for missing-value imputation in microarray and RNA-sequencing data. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:8741-8759. [PMID: 35942733 DOI: 10.3934/mbe.2022405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Microarray and RNA-sequencing (RNA-seq) techniques each produce gene expression data that can be expressed as a matrix that often contains missing values. Thus, a process of missing-value imputation that uses coherence information of the dataset is necessary. Existing imputation methods, such as iterative bicluster-based least squares (bi-iLS), use biclustering to estimate the missing values because genes are only similar under correlative experimental conditions. Also, they use the row average to obtain a temporary complete matrix, but the use of the row average is considered to be a flaw. The row average cannot reflect the real structure of the dataset because the row average only uses the information of an individual row. Therefore, we propose the use of Bayesian principal component analysis (BPCA) to obtain the temporary complete matrix instead of using the row average in bi-iLS. This alteration produces new missing values imputation method called iterative bicluster-based Bayesian principal component analysis and least squares (bi-BPCA-iLS). Several experiments have been conducted on two-dimension independent gene expression datasets, which are microarray (e.g., cell-cycle expression dataset of yeast saccharomyces cerevisiae) and RNA-seq (gene expression data from schizosaccharomyces pombe) datasets. In the case of the microarray dataset, our proposed bi-BPCA-iLS method showed a significant overall improvement in the normalized root mean square error (NRMSE) values of 10.6% from the local least squares (LLS) and 0.6% from the bi-iLS. In the case of the RNA-seq dataset, our proposed bi-BPCA-iLS method showed an overall improvement in the NRMSE values of 8.2% from the LLS and 3.1% from the bi-iLS. The additional computational time of bi-BPCA-iLS is not significant compared to bi-iLS.
Collapse
Affiliation(s)
- Saskya Mary Soemartojo
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Titin Siswantining
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Yoel Fernando
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Devvi Sarwinda
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Herley Shaori Al-Ash
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Sarah Syarofina
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| | - Noval Saputra
- Department of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Indonesia
| |
Collapse
|
22
|
A joint optimization framework integrated with biological knowledge for clustering incomplete gene expression data. Soft comput 2022. [DOI: 10.1007/s00500-022-07180-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
23
|
Pham TH, Qiu Y, Liu J, Zimmer S, O’Neill E, Xie L, Zhang P. Chemical-induced gene expression ranking and its application to pancreatic cancer drug repurposing. PATTERNS 2022; 3:100441. [PMID: 35465231 PMCID: PMC9023899 DOI: 10.1016/j.patter.2022.100441] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 09/13/2021] [Accepted: 01/12/2022] [Indexed: 12/18/2022]
Abstract
Chemical-induced gene expression profiles provide critical information of chemicals in a biological system, thus offering new opportunities for drug discovery. Despite their success, large-scale analysis leveraging gene expressions is limited by time and cost. Although several methods for predicting gene expressions were proposed, they only focused on imputation and classification settings, which have limited applications to real-world scenarios of drug discovery. Therefore, a chemical-induced gene expression ranking (CIGER) framework is proposed to target a more realistic but more challenging setting in which overall rankings in gene expression profiles induced by de novo chemicals are predicted. The experimental results show that CIGER significantly outperforms existing methods in both ranking and classification metrics. Furthermore, a drug screening pipeline based on CIGER is proposed to identify potential treatments of drug-resistant pancreatic cancer. Our predictions have been validated by experiments, thereby showing the effectiveness of CIGER for phenotypic compound screening of precision medicine. A new deep-learning method (CIGER) for chemical-induced gene expression ranking CIGER can predict gene expression for de novo chemicals from chemical structures We discovered drugs for the treatment of drug-resistant pancreatic cancer
In recent years, a phenotype-based drug discovery approach using chemical-induced gene expressions has shown to be effective in drug discovery and precision medicine. However, it is not feasible to experimentally determine chemical-induced gene expressions for all available chemicals of interest, thereby hindering the application of gene expression-based compound screening on a large scale. Thus, it is crucial to design a computational approach that can generate gene expression information for any chemicals. We proposed a new, deep-learning framework named chemical-induced gene expression ranking (CIGER) to predict a landmark gene expression profile (i.e., gene ranking) induced by de novo chemicals based on their chemical structures. Leveraging CIGER, we predicted and experimentally validated that several existing drugs can increase the therapeutic response on drug-resistant pancreatic cancer. Our results demonstrated the effectiveness of CIGER for precision drug discovery in practice.
Collapse
Affiliation(s)
- Thai-Hoang Pham
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
| | - Yue Qiu
- Ph.D. Program in Biology, The Graduate Center, The City University of New York, New York, NY 10016, USA
| | - Jiahui Liu
- Department of Oncology, University of Oxford, Oxford OX3 7DQ, UK
| | | | - Eric O’Neill
- Department of Oncology, University of Oxford, Oxford OX3 7DQ, UK
- EpiCombi.AI Therapeutics, Oxford OX7 3SB, UK
| | - Lei Xie
- Ph.D. Program in Biology, The Graduate Center, The City University of New York, New York, NY 10016, USA
- Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, USA
- Ph.D. Program in Computer Science and Biochemistry, The Graduate Center, The City University of New York, New York, NY 10016, USA
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, NY 10021, USA
| | - Ping Zhang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, OH 43210, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH 43210, USA
- Translational Data Analytics Institute, The Ohio State University, Columbus, OH 43210, USA
- Corresponding author
| |
Collapse
|
24
|
Shi Q, Miao T, Liu Y, Hu L, Yang H, Shen H, Piao M, Huang Z, Zhang Z. Fabrication and Decryption of a Microarray of Digital Dithiosuccinimide Oligomers. Macromol Rapid Commun 2022; 43:e2200029. [PMID: 35322486 DOI: 10.1002/marc.202200029] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/11/2022] [Indexed: 11/11/2022]
Abstract
Digital polymer with precisely arranged binary units provides an important option for information storage. This is especially true if the digital polymers are assembled in a device, as it would be of great benefit to data writing and reading in practice. Herein, inspired by DNA microarray technique, the programmable information storing and reading on a mass spectrometry target plate is proposed. First, an array of 4-bit sequence-coded dithiosuccinimide oligomers was efficiently built through sequential thiol-maleimide Michael couplings with good sequence readability by tandem mass spectrometry (MS/MS). Then, toward engineering microarray for information storage, a programmed robotic arm was specifically designed for precisely loading sequence-coded oligomers onto the target plate, and a decoding software was developed for efficient readout of the data from MS/MS sequencing. Notably, short sequence-coded oligomers chains can be used to write long strings of information, and extra error-correction codes are not required as usual due to the inherent concomitant fragmentation signals. Not only text but also bitimages can be automatically stored and decoded with excellent accuracy. This work provides a promising platform of digital polymers for programmable information storing and reading. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Qiunan Shi
- Q. Shi, T. Miao, Y. Liu, Prof. H. Shen, Prof. Z. Huang, College of Chemistry, Chemical Engineering and Materials Science, Soochow University, Suzhou, 215123, China
| | - Tengfei Miao
- Q. Shi, T. Miao, Y. Liu, Prof. H. Shen, Prof. Z. Huang, College of Chemistry, Chemical Engineering and Materials Science, Soochow University, Suzhou, 215123, China
| | - Yuxin Liu
- Q. Shi, T. Miao, Y. Liu, Prof. H. Shen, Prof. Z. Huang, College of Chemistry, Chemical Engineering and Materials Science, Soochow University, Suzhou, 215123, China
| | - Lihua Hu
- Dr. L. Hu, Analysis and Testing Center, Soochow University, Suzhou, 215123, China
| | - Hai Yang
- H. Yang, Eurosmart Intelligent Technology Research Institute, Nanjing, 211106, China
| | - Hang Shen
- Q. Shi, T. Miao, Y. Liu, Prof. H. Shen, Prof. Z. Huang, College of Chemistry, Chemical Engineering and Materials Science, Soochow University, Suzhou, 215123, China
| | - Minghao Piao
- Prof. M. Piao, Collaborative Innovation Center of Novel Software Technology and Industrialization, School of Computer Science and Technology, Soochow University, Suzhou, 215123, China
| | - Zhihao Huang
- Q. Shi, T. Miao, Y. Liu, Prof. H. Shen, Prof. Z. Huang, College of Chemistry, Chemical Engineering and Materials Science, Soochow University, Suzhou, 215123, China
| | - Zhengbiao Zhang
- Prof. Z. Zhang, College of Chemistry, Chemical Engineering and Materials Science, State Key Laboratory of Radiation Medicine and Protection, Soochow University, Suzhou, 215123, China
| |
Collapse
|
25
|
Mohammad Mirzaei N, Changizi N, Asadpoure A, Su S, Sofia D, Tatarova Z, Zervantonakis IK, Chang YH, Shahriyari L. Investigating key cell types and molecules dynamics in PyMT mice model of breast cancer through a mathematical model. PLoS Comput Biol 2022; 18:e1009953. [PMID: 35294447 PMCID: PMC8959189 DOI: 10.1371/journal.pcbi.1009953] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 03/28/2022] [Accepted: 02/22/2022] [Indexed: 02/07/2023] Open
Abstract
The most common kind of cancer among women is breast cancer. Understanding the tumor microenvironment and the interactions between individual cells and cytokines assists us in arriving at more effective treatments. Here, we develop a data-driven mathematical model to investigate the dynamics of key cell types and cytokines involved in breast cancer development. We use time-course gene expression profiles of a mouse model to estimate the relative abundance of cells and cytokines. We then employ a least-squares optimization method to evaluate the model's parameters based on the mice data. The resulting dynamics of the cells and cytokines obtained from the optimal set of parameters exhibit a decent agreement between the data and predictions. We perform a sensitivity analysis to identify the crucial parameters of the model and then perform a local bifurcation on them. The results reveal a strong connection between adipocytes, IL6, and the cancer population, suggesting them as potential targets for therapies.
Collapse
Affiliation(s)
- Navid Mohammad Mirzaei
- Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Navid Changizi
- Department of Civil and Environmental Engineering, University of Massachusetts, Dartmouth, Massachusetts, United States of America
| | - Alireza Asadpoure
- Department of Civil and Environmental Engineering, University of Massachusetts, Dartmouth, Massachusetts, United States of America
| | - Sumeyye Su
- Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Dilruba Sofia
- Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| | - Zuzana Tatarova
- Department of Biomedical Engineering and OHSU Center for Spatial Systems Biomedicine (OCSSB), Oregon Health and Science University, Portland, Oregon, United States of America
| | - Ioannis K. Zervantonakis
- Department of Bioengineering, UPMC Hillman Cancer Center, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Young Hwan Chang
- Department of Biomedical Engineering and OHSU Center for Spatial Systems Biomedicine (OCSSB), Oregon Health and Science University, Portland, Oregon, United States of America
| | - Leili Shahriyari
- Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, United States of America
| |
Collapse
|
26
|
Ni Z, Zheng X, Zheng X, Zou X. scLRTD : A Novel Low Rank Tensor Decomposition Method for Imputing Missing Values in Single-Cell Multi-Omics Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1144-1153. [PMID: 32960767 DOI: 10.1109/tcbb.2020.3025804] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the successful application of single-cell sequencing technology, a large number of single-cell multi-omics sequencing (scMO-seq)data have been generated, which enables researchers to study heterogeneity between individual cells. One prominent problem in single-cell data analysis is the prevalence of dropouts, caused by failures in amplification during the experiments. It is necessary to develop effective approaches for imputing the missing values. Different with general methods imputing single type of single-cell data, we propose an imputation method called scLRTD, using low-rank tensor decomposition based on nuclear norm to impute scMO-seq data and single-cell RNA-sequencing (scRNA-seq)data with different stages, tissues or conditions. Furthermore, four sets of simulated and two sets of real scRNA-seq data from mouse embryonic stem cells and hepatocellular carcinoma, respectively, are used to carry out numerical experiments and compared with other six published methods. Error accuracy and clustering results demonstrate the effectiveness of proposed method. Moreover, we clearly identify two cell subpopulations after imputing the real scMO-seq data from hepatocellular carcinoma. Further, Gene Ontology identifies 7 genes in Bile secretion pathway, which is related to metabolism in hepatocellular carcinoma. The survival analysis using the database TCGA also show that two cell subpopulations after imputing have distinguished survival rates.
Collapse
|
27
|
Baruzzo G, Patuzzi I, Di Camillo B. Beware to ignore the rare: how imputing zero-values can improve the quality of 16S rRNA gene studies results. BMC Bioinformatics 2022; 22:618. [PMID: 35130833 PMCID: PMC8822630 DOI: 10.1186/s12859-022-04587-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 01/27/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND 16S rRNA-gene sequencing is a valuable approach to characterize the taxonomic content of the whole bacterial population inhabiting a metabolic and spatial niche, providing an important opportunity to study bacteria and their role in many health and environmental mechanisms. The analysis of data produced by amplicon sequencing, however, brings very specific methodological issues that need to be properly addressed to obtain reliable biological conclusions. Among these, 16S count data tend to be very sparse, with many null values reflecting species that are present but got unobserved due to the multiplexing constraints. However, current data workflows do not consider a step in which the information about unobserved species is recovered. RESULTS In this work, we evaluate for the first time the effects of introducing in the 16S data workflow a new preprocessing step, zero-imputation, to recover this lost information. Due to the lack of published zero-imputation methods specifically designed for 16S count data, we considered a set of zero-imputation strategies available for other frameworks, and benchmarked them using in silico 16S count data reflecting different experimental designs. Additionally, we assessed the effect of combining zero-imputation and normalization, i.e. the only preprocessing step in current 16S workflow. Overall, we benchmarked 35 16S preprocessing pipelines assessing their ability to handle data sparsity, identify species presence/absence, recovery sample proportional abundance distributions, and improve typical downstream analyses such as computation of alpha and beta diversity indices and differential abundance analysis. CONCLUSIONS The results clearly show that 16S data analysis greatly benefits from a properly-performed zero-imputation step, despite the choice of the right zero-imputation method having a pivotal role. In addition, we identify a set of best-performing pipelines that could be a valuable indication for data analysts.
Collapse
Affiliation(s)
- Giacomo Baruzzo
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Ilaria Patuzzi
- Department of Information Engineering, University of Padova, Padua, Italy
- Microbial Ecology Unit, Istituto Zooprofilattico Sperimentale Delle Venezie, Padua, Italy
- Research & Development Division, EuBiome S.R.L., Padua, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Padua, Italy.
- CRIBI Biotechnology Centre, University of Padova, Padua, Italy.
- Department of Comparative Biomedicine and Food Science, University of Padova, Padua, Italy.
| |
Collapse
|
28
|
Dubey A, Rasool A. Efficient technique of microarray missing data imputation using clustering and weighted nearest neighbour. Sci Rep 2021; 11:24297. [PMID: 34934107 PMCID: PMC8692342 DOI: 10.1038/s41598-021-03438-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 11/22/2021] [Indexed: 02/03/2023] Open
Abstract
For most bioinformatics statistical methods, particularly for gene expression data classification, prognosis, and prediction, a complete dataset is required. The gene sample value can be missing due to hardware failure, software failure, or manual mistakes. The missing data in gene expression research dramatically affects the analysis of the collected data. Consequently, this has become a critical problem that requires an efficient imputation algorithm to resolve the issue. This paper proposed a technique considering the local similarity structure that predicts the missing data using clustering and top K nearest neighbor approaches for imputing the missing value. A similarity-based spectral clustering approach is used that is combined with the K-means. The spectral clustering parameters, cluster size, and weighting factors are optimized, and after that, missing values are predicted. For imputing each cluster’s missing value, the top K nearest neighbor approach utilizes the concept of weighted distance. The evaluation is carried out on numerous datasets from a variety of biological areas, with experimentally inserted missing values varying from 5 to 25%. Experimental results prove that the proposed imputation technique makes accurate predictions as compared to other imputation procedures. In this paper, for performing the imputation experiments, microarray gene expression datasets consisting of information of different cancers and tumors are considered. The main contribution of this research states that local similarity-based techniques can be used for imputation even when the dataset has varying dimensionality and characteristics.
Collapse
Affiliation(s)
- Aditya Dubey
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, 462003, India.
| | - Akhtar Rasool
- Department of Computer Science & Engineering, Maulana Azad National Institute of Technology, Bhopal, 462003, India
| |
Collapse
|
29
|
Gwark S, Ahn HS, Yeom J, Yu J, Oh Y, Jeong JH, Ahn JH, Jung KH, Kim SB, Lee HJ, Gong G, Lee SB, Chung IY, Kim HJ, Ko BS, Lee JW, Son BH, Ahn SH, Kim K, Kim J. Plasma Proteome Signature to Predict the Outcome of Breast Cancer Patients Receiving Neoadjuvant Chemotherapy. Cancers (Basel) 2021; 13:6267. [PMID: 34944885 PMCID: PMC8699627 DOI: 10.3390/cancers13246267] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 12/07/2021] [Accepted: 12/10/2021] [Indexed: 12/31/2022] Open
Abstract
The plasma proteome of 51 non-metastatic breast cancer patients receiving neoadjuvant chemotherapy (NCT) was prospectively analyzed by high-resolution mass spectrometry coupled with nano-flow liquid chromatography using blood drawn at the time of diagnosis. Plasma proteins were identified as potential biomarkers, and their correlation with clinicopathological variables and survival outcomes was analyzed. Of 51 patients, 20 (39.2%) were HR+/HER2-, five (9.8%) were HR+/HER2+, five (9.8%) were HER2+, and 21 (41.2%) were triple-negative subtype. During a median follow-up of 52.0 months, there were 15 relapses (29.4%) and eight deaths (15.7%). Four potential biomarkers were identified among differentially expressed proteins: APOC3 had higher plasma concentrations in the pathological complete response (pCR) group, whereas MBL2, ENG, and P4HB were higher in the non-pCR group. Proteins statistically significantly associated with survival and capable of differentiating low- and high-risk groups were MBL2 and P4HB for disease-free survival, P4HB for overall survival, and MBL2 for distant metastasis-free survival (DMFS). In the multivariate analysis, only MBL2 was a consistent risk factor for DMFS (HR: 9.65, 95% CI 2.10-44.31). The results demonstrate that the proteomes from non-invasive sampling correlate with pCR and survival in breast cancer patients receiving NCT. Further investigation may clarify the role of these proteins in predicting prognosis and thus their therapeutic potential for the prevention of recurrence.
Collapse
Affiliation(s)
- Sungchan Gwark
- Department of Surgery, Ewha Womans University Mokdong Hospital, Ewha Womans University College of Medicine, Seoul 07985, Korea;
| | - Hee-Sung Ahn
- Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea; (H.-S.A.); (J.Y.); (Y.O.)
- Convergence Medicine Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea;
| | - Jeonghun Yeom
- Convergence Medicine Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea;
| | - Jiyoung Yu
- Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea; (H.-S.A.); (J.Y.); (Y.O.)
| | - Yumi Oh
- Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea; (H.-S.A.); (J.Y.); (Y.O.)
- Department of Biomedical Sciences, University of Ulsan College of Medicine, Seoul 05505, Korea
| | - Jae Ho Jeong
- Department of Oncology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (J.H.J.); (J.-H.A.); (K.H.J.); (S.-B.K.)
| | - Jin-Hee Ahn
- Department of Oncology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (J.H.J.); (J.-H.A.); (K.H.J.); (S.-B.K.)
| | - Kyung Hae Jung
- Department of Oncology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (J.H.J.); (J.-H.A.); (K.H.J.); (S.-B.K.)
| | - Sung-Bae Kim
- Department of Oncology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (J.H.J.); (J.-H.A.); (K.H.J.); (S.-B.K.)
| | - Hee Jin Lee
- Department of Pathology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (H.J.L.); (G.G.)
| | - Gyungyub Gong
- Department of Pathology, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (H.J.L.); (G.G.)
| | - Sae Byul Lee
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Il Yong Chung
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Hee Jeong Kim
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Beom Seok Ko
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Jong Won Lee
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Byung Ho Son
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Sei Hyun Ahn
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| | - Kyunggon Kim
- Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea; (H.-S.A.); (J.Y.); (Y.O.)
- Convergence Medicine Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul 05505, Korea;
- Department of Biomedical Sciences, University of Ulsan College of Medicine, Seoul 05505, Korea
- Clinical Proteomics Core Laboratory, Convergence Medicine Research Center, Asan Medical Center, Seoul 05505, Korea
- Bio-Medical Institute of Technology, Asan Medical Center, Seoul 05505, Korea
| | - Jisun Kim
- Department of Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Korea; (S.B.L.); (I.Y.C.); (H.J.K.); (B.S.K.); (J.W.L.); (B.H.S.); (S.H.A.)
| |
Collapse
|
30
|
Abstract
Network modeling transforms data into a structure of nodes and edges such that edges represent relationships between pairs of objects, then extracts clusters of densely connected nodes in order to capture high-dimensional relationships hidden in the data. This efficient and flexible strategy holds potential for unveiling complex patterns concealed within massive datasets, but standard implementations overlook several key issues that can undermine research efforts. These issues range from data imputation and discretization to correlation metrics, clustering methods, and validation of results. Here, we enumerate these pitfalls and provide practical strategies for alleviating their negative effects. These guidelines increase prospects for future research endeavors as they reduce type I and type II (false-positive and false-negative) errors and are generally applicable for network modeling applications across diverse domains.
Collapse
Affiliation(s)
- Sharlee Climer
- Department of Computer Science, University of Missouri – St. Louis, St. Louis, MO, USA
| |
Collapse
|
31
|
Zimmermann R, Lang S, Lerner M, Förster F, Nguyen D, Helms V, Schrul B. Quantitative Proteomics and Differential Protein Abundance Analysis after the Depletion of PEX3 from Human Cells Identifies Additional Aspects of Protein Targeting to the ER. Int J Mol Sci 2021; 22:ijms222313028. [PMID: 34884833 PMCID: PMC8658024 DOI: 10.3390/ijms222313028] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/19/2021] [Accepted: 11/29/2021] [Indexed: 12/12/2022] Open
Abstract
Protein import into the endoplasmic reticulum (ER) is the first step in the biogenesis of around 10,000 different soluble and membrane proteins in humans. It involves the co- or post-translational targeting of precursor polypeptides to the ER, and their subsequent membrane insertion or translocation. So far, three pathways for the ER targeting of precursor polypeptides and four pathways for the ER targeting of mRNAs have been described. Typically, these pathways deliver their substrates to the Sec61 polypeptide-conducting channel in the ER membrane. Next, the precursor polypeptides are inserted into the ER membrane or translocated into the ER lumen, which may involve auxiliary translocation components, such as the TRAP and Sec62/Sec63 complexes, or auxiliary membrane protein insertases, such as EMC and the TMCO1 complex. Recently, the PEX19/PEX3-dependent pathway, which has a well-known function in targeting and inserting various peroxisomal membrane proteins into pre-existent peroxisomal membranes, was also found to act in the targeting and, putatively, insertion of monotopic hairpin proteins into the ER. These either remain in the ER as resident ER membrane proteins, or are pinched off from the ER as components of new lipid droplets. Therefore, the question arose as to whether this pathway may play a more general role in ER protein targeting, i.e., whether it represents a fourth pathway for the ER targeting of precursor polypeptides. Thus, we addressed the client spectrum of the PEX19/PEX3-dependent pathway in both PEX3-depleted HeLa cells and PEX3-deficient Zellweger patient fibroblasts by an established approach which involved the label-free quantitative mass spectrometry of the total proteome of depleted or deficient cells, as well as differential protein abundance analysis. The negatively affected proteins included twelve peroxisomal proteins and two hairpin proteins of the ER, thus confirming two previously identified classes of putative PEX19/PEX3 clients in human cells. Interestingly, fourteen collagen-related proteins with signal peptides or N-terminal transmembrane helices belonging to the secretory pathway were also negatively affected by PEX3 deficiency, which may suggest compromised collagen biogenesis as a hitherto-unknown contributor to organ failures in the respective Zellweger patients.
Collapse
Affiliation(s)
- Richard Zimmermann
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.L.); (M.L.)
- Correspondence: (R.Z.); (B.S.)
| | - Sven Lang
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.L.); (M.L.)
| | - Monika Lerner
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.L.); (M.L.)
| | - Friedrich Förster
- Bijvoet Center for Biomolecular Research, Utrecht University, 3584 CH Utrecht, The Netherlands;
| | - Duy Nguyen
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66041 Saarbrücken, Germany; (D.N.); (V.H.)
| | - Volkhard Helms
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66041 Saarbrücken, Germany; (D.N.); (V.H.)
| | - Bianca Schrul
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.L.); (M.L.)
- Correspondence: (R.Z.); (B.S.)
| |
Collapse
|
32
|
Wang J, Zou Q, Lin C. A comparison of deep learning-based pre-processing and clustering approaches for single-cell RNA sequencing data. Brief Bioinform 2021; 23:6361043. [PMID: 34472590 DOI: 10.1093/bib/bbab345] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Revised: 07/22/2021] [Accepted: 08/04/2021] [Indexed: 11/13/2022] Open
Abstract
The emergence of single cell RNA sequencing has facilitated the studied of genomes, transcriptomes and proteomes. As available single-cell RNA-seq datasets are released continuously, one of the major challenges facing traditional RNA analysis tools is the high-dimensional, high-sparsity, high-noise and large-scale characteristics of single-cell RNA-seq data. Deep learning technologies match the characteristics of single-cell RNA-seq data perfectly and offer unprecedented promise. Here, we give a systematic review for most popular single-cell RNA-seq analysis methods and tools based on deep learning models, involving the procedures of data preprocessing (quality control, normalization, data correction, dimensionality reduction and data visualization) and clustering task for downstream analysis. We further evaluate the deep model-based analysis methods of data correction and clustering quantitatively on 11 gold standard datasets. Moreover, we discuss the data preferences of these methods and their limitations, and give some suggestions and guidance for users to select appropriate methods and tools.
Collapse
Affiliation(s)
- Jiacheng Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- School of Informatics, Xiamen University, Xiamen, China
| | - Chen Lin
- School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
33
|
Wu Y, Qie R, Cheng M, Zeng Y, Huang S, Guo C, Zhou Q, Li Q, Tian G, Han M, Zhang Y, Wu X, Li Y, Zhao Y, Yang X, Feng Y, Liu D, Qin P, Hu D, Hu F, Xu L, Zhang M. Air pollution and DNA methylation in adults: A systematic review and meta-analysis of observational studies. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2021; 284:117152. [PMID: 33895575 DOI: 10.1016/j.envpol.2021.117152] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Revised: 04/04/2021] [Accepted: 04/05/2021] [Indexed: 05/24/2023]
Abstract
This systematic review and meta-analysis aimed to investigate the association between air pollution and DNA methylation in adults from published observational studies. PubMed, Web of Science and Embase databases were systematically searched for available studies on the association between air pollution and DNA methylation published up to March 9, 2021. Three DNA methylation approaches were considered: global methylation, candidate-gene, and epigenome-wide association studies (EWAS). Meta-analysis was used to summarize the combined estimates for the association between air pollutants and global DNA methylation levels. Heterogeneity was assessed with the Cochran Q test and quantified with the I2 statistic. In total, 38 articles were included in this study: 16 using global methylation, 18 using candidate genes, and 11 using EWAS, with 7 studies using more than one approach. Meta-analysis revealed an imprecise but inverse association between exposure to PM2.5 and global DNA methylation (for each 10-μg/m3 PM2.5, combined estimate: 0.39; 95% confidence interval: 0.97 - 0.19). The candidate-gene results were consistent for the ERCC3 and SOX2 genes, suggesting hypermethylation in ERCC3 associated with benzene and that in SOX2 associated with PM2.5 exposure. EWAS identified 201 CpG sites and 148 differentially methylated regions that showed differential methylation associated with air pollution. Among the 307 genes investigated in 11 EWAS, a locus in nucleoredoxin gene was found to be positively associated with PM2.5 in two studies. Current meta-analysis indicates that PM2.5 is imprecisely and inversely associated with DNA methylation. The candidate-gene results consistently suggest hypermethylation in ERCC3 associated with benzene exposure and that in SOX2 associated with PM2.5 exposure. The Kyoto Encyclopedia of Genes and Genomes (KEGG) network analyses revealed that these genes were associated with African trypanosomiasis, Malaria, Antifolate resistance, Graft-versus-host disease, and so on. More evidence is needed to clarify the association between air pollution and DNA methylation.
Collapse
Affiliation(s)
- Yuying Wu
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Ranran Qie
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Min Cheng
- Department of Cardiology, Shenzhen Second People's Hospital, The First Affiliated Hospital of Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Yunhong Zeng
- Center for Health Management, The Affiliated Shenzhen Hospital of University of Chinese Academy of Sciences, Shenzhen, Guangdong, People's Republic of China
| | - Shengbing Huang
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Chunmei Guo
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Qionggui Zhou
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Quanman Li
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Gang Tian
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Minghui Han
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Yanyan Zhang
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Xiaoyan Wu
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Yang Li
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Yang Zhao
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Xingjin Yang
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Yifei Feng
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Dechen Liu
- Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Pei Qin
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Dongsheng Hu
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, People's Republic of China
| | - Fulan Hu
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Lidan Xu
- Department of Nutrition, The Second Affiliated Hospital, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China
| | - Ming Zhang
- Department of Biostatistics and Epidemiology, School of Public Health, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China; Guangdong Provincial Key Laboratory of Regional Immunity and Diseases, Shenzhen University Health Science Center, Shenzhen, Guangdong, People's Republic of China.
| |
Collapse
|
34
|
Nguyen T, Nguyen DH, Nguyen H, Nguyen BT, Wade BA. EPEM: Efficient Parameter Estimation for Multiple Class Monotone Missing Data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.02.077] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
35
|
Rahmatbakhsh M, Gagarinova A, Babu M. Bioinformatic Analysis of Temporal and Spatial Proteome Alternations During Infections. Front Genet 2021; 12:667936. [PMID: 34276775 PMCID: PMC8283032 DOI: 10.3389/fgene.2021.667936] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 06/08/2021] [Indexed: 12/13/2022] Open
Abstract
Microbial pathogens have evolved numerous mechanisms to hijack host's systems, thus causing disease. This is mediated by alterations in the combined host-pathogen proteome in time and space. Mass spectrometry-based proteomics approaches have been developed and tailored to map disease progression. The result is complex multidimensional data that pose numerous analytic challenges for downstream interpretation. However, a systematic review of approaches for the downstream analysis of such data has been lacking in the field. In this review, we detail the steps of a typical temporal and spatial analysis, including data pre-processing steps (i.e., quality control, data normalization, the imputation of missing values, and dimensionality reduction), different statistical and machine learning approaches, validation, interpretation, and the extraction of biological information from mass spectrometry data. We also discuss current best practices for these steps based on a collection of independent studies to guide users in selecting the most suitable strategies for their dataset and analysis objectives. Moreover, we also compiled the list of commonly used R software packages for each step of the analysis. These could be easily integrated into one's analysis pipeline. Furthermore, we guide readers through various analysis steps by applying these workflows to mock and host-pathogen interaction data from public datasets. The workflows presented in this review will serve as an introduction for data analysis novices, while also helping established users update their data analysis pipelines. We conclude the review by discussing future directions and developments in temporal and spatial proteomics and data analysis approaches. Data analysis codes, prepared for this review are available from https://github.com/BabuLab-UofR/TempSpac, where guidelines and sample datasets are also offered for testing purposes.
Collapse
Affiliation(s)
| | - Alla Gagarinova
- Department of Biochemistry, Microbiology, & Immunology, University of Saskatchewan, Saskatoon, SK, Canada
| | - Mohan Babu
- Department of Biochemistry, University of Regina, Regina, SK, Canada
| |
Collapse
|
36
|
Faisal S, Tutz G. Imputation methods for high-dimensional mixed-type datasets by nearest neighbors. Comput Biol Med 2021; 135:104577. [PMID: 34216892 DOI: 10.1016/j.compbiomed.2021.104577] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Revised: 06/10/2021] [Accepted: 06/11/2021] [Indexed: 11/18/2022]
Abstract
In modern biomedical research, the data often contain a large number of variables of mixed data types (continuous, multi-categorical, or binary) but on some variables observations are missing. Imputation is a common solution when the downstream analyses require a complete data matrix. Several imputation methods are available that work under specific distributional assumptions. We propose an improvement over the popular non-parametric nearest neighbor imputation method which requires no particular assumptions. The proposed method makes practical and effective use of the information on the association among the variables. In particular, we propose a weighted version of the Lq distance for mixed-type data, which uses the information from a subset of important variables only. The performance of the proposed method is investigated using a variety of simulated and real data from different areas of application. The results show that the proposed methods yield smaller imputation error and better performance when compared to other approaches. It is also shown that the proposed imputation method works efficiently even when the number of samples is smaller than the number of variables.
Collapse
Affiliation(s)
- Shahla Faisal
- Government College University Faisalabad, Pakistan; Ludwig-Maximilians-Universität München, Germany.
| | | |
Collapse
|
37
|
Bhadra P, Schorr S, Lerner M, Nguyen D, Dudek J, Förster F, Helms V, Lang S, Zimmermann R. Quantitative Proteomics and Differential Protein Abundance Analysis after Depletion of Putative mRNA Receptors in the ER Membrane of Human Cells Identifies Novel Aspects of mRNA Targeting to the ER. Molecules 2021; 26:3591. [PMID: 34208277 PMCID: PMC8230838 DOI: 10.3390/molecules26123591] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Revised: 06/07/2021] [Accepted: 06/09/2021] [Indexed: 11/28/2022] Open
Abstract
In human cells, one-third of all polypeptides enter the secretory pathway at the endoplasmic reticulum (ER). The specificity and efficiency of this process are guaranteed by targeting of mRNAs and/or polypeptides to the ER membrane. Cytosolic SRP and its receptor in the ER membrane facilitate the cotranslational targeting of most ribosome-nascent precursor polypeptide chain (RNC) complexes together with the respective mRNAs to the Sec61 complex in the ER membrane. Alternatively, fully synthesized precursor polypeptides are targeted to the ER membrane post-translationally by either the TRC, SND, or PEX19/3 pathway. Furthermore, there is targeting of mRNAs to the ER membrane, which does not involve SRP but involves mRNA- or RNC-binding proteins on the ER surface, such as RRBP1 or KTN1. Traditionally, the targeting reactions were studied in cell-free or cellular assays, which focus on a single precursor polypeptide and allow the conclusion of whether a certain precursor can use a certain pathway. Recently, cellular approaches such as proximity-based ribosome profiling or quantitative proteomics were employed to address the question of which precursors use certain pathways under physiological conditions. Here, we combined siRNA-mediated depletion of putative mRNA receptors in HeLa cells with label-free quantitative proteomics and differential protein abundance analysis to characterize RRBP1- or KTN1-involving precursors and to identify possible genetic interactions between the various targeting pathways. Furthermore, we discuss the possible implications on the so-called TIGER domains and critically discuss the pros and cons of this experimental approach.
Collapse
Affiliation(s)
- Pratiti Bhadra
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66041 Saarbrücken, Germany; (P.B.); (D.N.); (V.H.)
| | - Stefan Schorr
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.S.); (M.L.); (J.D.); (S.L.)
| | - Monika Lerner
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.S.); (M.L.); (J.D.); (S.L.)
| | - Duy Nguyen
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66041 Saarbrücken, Germany; (P.B.); (D.N.); (V.H.)
| | - Johanna Dudek
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.S.); (M.L.); (J.D.); (S.L.)
| | - Friedrich Förster
- Bijvoet Center for Biomolecular Research, Utrecht University, 3584 CH Utrecht, The Netherlands;
| | - Volkhard Helms
- Center for Bioinformatics, Saarland Informatics Campus, Saarland University, 66041 Saarbrücken, Germany; (P.B.); (D.N.); (V.H.)
| | - Sven Lang
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.S.); (M.L.); (J.D.); (S.L.)
| | - Richard Zimmermann
- Medical Biochemistry and Molecular Biology, Saarland University, 66421 Homburg, Germany; (S.S.); (M.L.); (J.D.); (S.L.)
| |
Collapse
|
38
|
Dabke K, Kreimer S, Jones MR, Parker SJ. A Simple Optimization Workflow to Enable Precise and Accurate Imputation of Missing Values in Proteomic Data Sets. J Proteome Res 2021; 20:3214-3229. [PMID: 33939434 DOI: 10.1021/acs.jproteome.1c00070] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
Missing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, no single imputation method is best suited for a diverse range of data sets, and no clear strategy exists for evaluating imputation methods for clinical DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a strategy to assess imputation methods on clinical label-free DIA-MS data sets. We used three DIA-MS data sets with real missing values to evaluate eight imputation methods with multiple parameters at different levels of protein quantification: a dilution series data set, a small pilot data set, and a clinical proteomic data set comparing paired tumor and stroma tissue. We found that imputation methods based on local structures within the data, like local least-squares (LLS) and random forest (RF), worked well in our dilution series data set, whereas imputation methods based on global structures within the data, like BPCA, performed well in the other two data sets. We also found that imputation at the most basic protein quantification level-fragment level-improved accuracy and the number of proteins quantified. With this analytical framework, we quickly and cost-effectively evaluated different imputation methods using two smaller complementary data sets to narrow down to the larger proteomic data set's most accurate methods. This acquisition strategy allowed us to provide reproducible evidence of the accuracy of the imputation method, even in the absence of a ground truth. Overall, this study indicates that the most suitable imputation method relies on the overall structure of the data set and provides an example of an analytic framework that may assist in identifying the most appropriate imputation strategies for the differential analysis of proteins.
Collapse
Affiliation(s)
- Kruttika Dabke
- Center for Bioinformatics and Functional Genomics, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States.,Graduate Program in Biomedical Sciences, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| | - Simion Kreimer
- Advanced Clinical Biosystems Research Institute, Smidt Heart Institute, Departments of Cardiology and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| | - Michelle R Jones
- Center for Bioinformatics and Functional Genomics, Department of Biomedical Science, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| | - Sarah J Parker
- Advanced Clinical Biosystems Research Institute, Smidt Heart Institute, Departments of Cardiology and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, California 90048, United States
| |
Collapse
|
39
|
Zhu X, Wang J, Sun B, Ren C, Yang T, Ding J. An efficient ensemble method for missing value imputation in microarray gene expression data. BMC Bioinformatics 2021; 22:188. [PMID: 33849444 PMCID: PMC8045198 DOI: 10.1186/s12859-021-04109-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Accepted: 03/29/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The genomics data analysis has been widely used to study disease genes and drug targets. However, the existence of missing values in genomics datasets poses a significant problem, which severely hinders the use of genomics data. Current imputation methods based on a single learner often explores less known genomic data information for imputation and thus causes the imputation performance loss. RESULTS In this study, multiple single imputation methods are combined into an imputation method by ensemble learning. In the ensemble method, the bootstrap sampling is applied for predictions of missing values by each component method, and these predictions are weighted and summed to produce the final prediction. The optimal weights are learned from known gene data in the sense of minimizing a cost function about the imputation error. And the expression of the optimal weights is derived in closed form. Additionally, the performance of the ensemble method is analytically investigated, in terms of the sum of squared regression errors. The proposed method is simulated on several typical genomic datasets and compared with the state-of-the-art imputation methods at different noise levels, sample sizes and data missing rates. Experimental results show that the proposed method achieves the improved imputation performance in terms of the imputation accuracy, robustness and generalization. CONCLUSION The ensemble method possesses the superior imputation performance since it can make use of known data information more efficiently for missing data imputation by integrating diverse imputation methods and learning the integration weights in a data-driven way.
Collapse
Affiliation(s)
- Xinshan Zhu
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China.,State Key Laboratory of Digital Publishing Technology, Beijing, 100871, China
| | - Jiayu Wang
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
| | - Biao Sun
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China.
| | - Chao Ren
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
| | - Ting Yang
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
| | - Jie Ding
- China Institute of FTZ Supply Chain, Shanghai Maritime University, Shanghai, 201306, China
| |
Collapse
|
40
|
Pham TH, Qiu Y, Zeng J, Xie L, Zhang P. A deep learning framework for high-throughput mechanism-driven phenotype compound screening and its application to COVID-19 drug repurposing. NAT MACH INTELL 2021; 3:247-257. [PMID: 33796820 PMCID: PMC8009091 DOI: 10.1038/s42256-020-00285-9] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Accepted: 12/15/2020] [Indexed: 12/15/2022]
Abstract
Phenotype-based compound screening has advantages over target-based drug discovery, but is unscalable and lacks understanding of mechanism. Chemical-induced gene expression profile provides a mechanistic signature of phenotypic response. However, the use of such data is limited by their sparseness, unreliability, and relatively low throughput. Few methods can perform phenotype-based de novo chemical compound screening. Here, we propose a mechanism-driven neural network-based method DeepCE, which utilizes graph neural network and multi-head attention mechanism to model chemical substructure-gene and gene-gene associations, for predicting the differential gene expression profile perturbed by de novo chemicals. Moreover, we propose a novel data augmentation method which extracts useful information from unreliable experiments in L1000 dataset. The experimental results show that DeepCE achieves superior performances to state-of-the-art methods. The effectiveness of gene expression profiles generated from DeepCE is further supported by comparing them with observed data for downstream classification tasks. To demonstrate the value of DeepCE, we apply it to drug repurposing of COVID-19, and generate novel lead compounds consistent with clinical evidence. Thus, DeepCE provides a potentially powerful framework for robust predictive modeling by utilizing noisy omics data and screening novel chemicals for the modulation of a systemic response to disease.
Collapse
Affiliation(s)
- Thai-Hoang Pham
- Department of Computer Science and Engineering, The Ohio State University, Columbus, 43210, USA
| | - Yue Qiu
- Ph.D. Program in Biology, The Graduate Center, The City University of New York, New York, 10016, USA
| | - Jucheng Zeng
- Department of Biomedical Informatics, The Ohio State University, Columbus, 43210, USA
| | - Lei Xie
- Ph.D. Program in Biology, The Graduate Center, The City University of New York, New York, 10016, USA
- Department of Computer Science, Hunter College, The City University of New York, New York, 10065, USA
- Ph.D. Program in Computer Science and Biochemistry, The Graduate Center, The City University of New York, New York, 10016, USA
- Helen and Robert Appel Alzheimer’s Disease Research Institute, Feil Family Brain & Mind Research Institute, Weill Cornell Medicine, Cornell University, New York, 10021, USA
| | - Ping Zhang
- Department of Computer Science and Engineering, The Ohio State University, Columbus, 43210, USA
- Department of Biomedical Informatics, The Ohio State University, Columbus, 43210, USA
- Translational Data Analytics institute, The Ohio State University, Columbus, 43210, USA
| |
Collapse
|
41
|
A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci Rep 2021; 11:1760. [PMID: 33469060 PMCID: PMC7815892 DOI: 10.1038/s41598-021-81279-4] [Citation(s) in RCA: 67] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 12/31/2020] [Indexed: 12/29/2022] Open
Abstract
The presence of missing values (MVs) in label-free quantitative proteomics greatly reduces the completeness of data. Imputation has been widely utilized to handle MVs, and selection of the proper method is critical for the accuracy and reliability of imputation. Here we present a comparative study that evaluates the performance of seven popular imputation methods with a large-scale benchmark dataset and an immune cell dataset. Simulated MVs were incorporated into the complete part of each dataset with different combinations of MV rates and missing not at random (MNAR) rates. Normalized root mean square error (NRMSE) was applied to evaluate the accuracy of protein abundances and intergroup protein ratios after imputation. Detection of true positives (TPs) and false altered-protein discovery rate (FADR) between groups were also compared using the benchmark dataset. Furthermore, the accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. We observed that the accuracy of imputation is primarily affected by the MNAR rate rather than the MV rate, and downstream analysis can be largely impacted by the selection of imputation methods. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amount of TPs with the average FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics.
Collapse
|
42
|
Mancuso CA, Canfield JL, Singla D, Krishnan A. A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes. Nucleic Acids Res 2020; 48:e125. [PMID: 33074331 PMCID: PMC7708069 DOI: 10.1093/nar/gkaa881] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 08/24/2020] [Accepted: 09/28/2020] [Indexed: 12/15/2022] Open
Abstract
While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.
Collapse
Affiliation(s)
- Christopher A Mancuso
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Jacob L Canfield
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Deepak Singla
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Indian Institute of Technology, Delhi, India
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
43
|
An Exploratory Pilot Study with Plasma Protein Signatures Associated with Response of Patients with Depression to Antidepressant Treatment for 10 Weeks. Biomedicines 2020; 8:biomedicines8110455. [PMID: 33126421 PMCID: PMC7692261 DOI: 10.3390/biomedicines8110455] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Revised: 10/26/2020] [Accepted: 10/26/2020] [Indexed: 12/11/2022] Open
Abstract
Major depressive disorder (MDD) is a leading cause of global disability with a chronic and recurrent course. Recognition of biological markers that could predict and monitor response to drug treatment could personalize clinical decision-making, minimize unnecessary drug exposure, and achieve better outcomes. Four longitudinal plasma samples were collected from each of ten patients with MDD treated with antidepressants for 10 weeks. Plasma proteins were analyzed qualitatively and quantitatively with a nanoflow LC−MS/MS technique. Of 1153 proteins identified in the 40 longitudinal plasma samples, 37 proteins were significantly associated with response/time and clustered into six according to time and response by the linear mixed model. Among them, three early-drug response markers (PHOX2B, SH3BGRL3, and YWHAE) detectable within one week were verified by liquid chromatography-multiple reaction monitoring/mass spectrometry (LC-MRM/MS) in the well-controlled 24 patients. In addition, 11 proteins correlated significantly with two or more psychiatric measurement indices. This pilot study might be useful in finding protein marker candidates that can monitor response to antidepressant treatment during follow-up visits within 10 weeks after the baseline visit.
Collapse
|
44
|
A deep learning-based, unsupervised method to impute missing values in electronic health records for improved patient management. J Biomed Inform 2020; 111:103576. [PMID: 33010424 DOI: 10.1016/j.jbi.2020.103576] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Revised: 09/13/2020] [Accepted: 09/19/2020] [Indexed: 01/23/2023]
Abstract
Electronic health records (EHRs) often suffer missing values, for which recent advances in deep learning offer a promising remedy. We develop a deep learning-based, unsupervised method to impute missing values in patient records, then examine its imputation effectiveness and predictive efficacy for peritonitis patient management. Our method builds on a deep autoencoder framework, incorporates missing patterns, accounts for essential relationships in patient data, considers temporal patterns common to patient records, and employs a novel loss function for error calculation and regularization. Using a data set of 27,327 patient records, we perform a comparative evaluation of the proposed method and several prevalent benchmark techniques. The results indicate the greater imputation performance of our method relative to all the benchmark techniques, recording 5.3-15.5% lower imputation errors. Furthermore, the data imputed by the proposed method better predict readmission, length of stay, and mortality than those obtained from any benchmark techniques, achieving 2.7-11.5% improvements in predictive efficacy. The illustrated evaluation indicates the proposed method's viability, imputation effectiveness, and clinical decision support utilities. Overall, our method can reduce imputation biases and be applied to various missing value scenarios clinically, thereby empowering physicians and researchers to better analyze and utilize EHRs for improved patient management.
Collapse
|
45
|
Wang S, Li W, Hu L, Cheng J, Yang H, Liu Y. NAguideR: performing and prioritizing missing value imputations for consistent bottom-up proteomic analyses. Nucleic Acids Res 2020; 48:e83. [PMID: 32526036 PMCID: PMC7641313 DOI: 10.1093/nar/gkaa498] [Citation(s) in RCA: 93] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 04/20/2020] [Accepted: 06/08/2020] [Indexed: 02/05/2023] Open
Abstract
Mass spectrometry (MS)-based quantitative proteomics experiments frequently generate data with missing values, which may profoundly affect downstream analyses. A wide variety of imputation methods have been established to deal with the missing-value issue. To date, however, there is a scarcity of efficient, systematic, and easy-to-handle tools that are tailored for proteomics community. Herein, we developed a user-friendly and powerful stand-alone software, NAguideR, to enable implementation and evaluation of different missing value methods offered by 23 widely used missing-value imputation algorithms. NAguideR further evaluates data imputation results through classic computational criteria and, unprecedentedly, proteomic empirical criteria, such as quantitative consistency between different charge-states of the same peptide, different peptides belonging to the same proteins, and individual proteins participating protein complexes and functional interactions. We applied NAguideR into three label-free proteomic datasets featuring peptide-level, protein-level, and phosphoproteomic variables respectively, all generated by data independent acquisition mass spectrometry (DIA-MS) with substantial biological replicates. The results indicate that NAguideR is able to discriminate the optimal imputation methods that are facilitating DIA-MS experiments over those sub-optimal and low-performance algorithms. NAguideR further provides downloadable tables and figures supporting flexible data analysis and interpretation. NAguideR is freely available at http://www.omicsolution.org/wukong/NAguideR/ and the source code: https://github.com/wangshisheng/NAguideR/.
Collapse
Affiliation(s)
- Shisheng Wang
- West China-Washington Mitochondria and Metabolism Research Center; Key Lab of Transplant Engineering and Immunology, MOH, Regenerative Medicine Research Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Wenxue Li
- Yale Cancer Biology Institute, Yale University, West Haven, CT 06516, USA
| | - Liqiang Hu
- West China-Washington Mitochondria and Metabolism Research Center; Key Lab of Transplant Engineering and Immunology, MOH, Regenerative Medicine Research Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Jingqiu Cheng
- West China-Washington Mitochondria and Metabolism Research Center; Key Lab of Transplant Engineering and Immunology, MOH, Regenerative Medicine Research Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Hao Yang
- West China-Washington Mitochondria and Metabolism Research Center; Key Lab of Transplant Engineering and Immunology, MOH, Regenerative Medicine Research Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Yansheng Liu
- Yale Cancer Biology Institute, Yale University, West Haven, CT 06516, USA.,Department of Pharmacology, Yale University School of Medicine, New Haven, CT 06520, USA
| |
Collapse
|
46
|
Ma Q, Lee WC, Fu TY, Gu Y, Yu G. MIDIA: exploring denoising autoencoders for missing data imputation. Data Min Knowl Discov 2020. [DOI: 10.1007/s10618-020-00706-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
47
|
Pham TH, Qiu Y, Zeng J, Xie L, Zhang P. A deep learning framework for high-throughput mechanism-driven phenotype compound screening. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020. [PMID: 32743586 PMCID: PMC7386506 DOI: 10.1101/2020.07.19.211235] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Target-based high-throughput compound screening dominates conventional one-drug-one-gene drug discovery process. However, the readout from the chemical modulation of a single protein is poorly correlated with phenotypic response of organism, leading to high failure rate in drug development. Chemical-induced gene expression profile provides an attractive solution to phenotype-based screening. However, the use of such data is currently limited by their sparseness, unreliability, and relatively low throughput. Several methods have been proposed to impute missing values for gene expression datasets. However, few existing methods can perform de novo chemical compound screening. In this study, we propose a mechanism-driven neural network-based method named DeepCE (Deep Chemical Expression) which utilizes graph convolutional neural network to learn chemical representation and multi-head attention mechanism to model chemical substructure-gene and gene-gene feature associations. In addition, we propose a novel data augmentation method which extracts useful information from unreliable experiments in L1000 dataset. The experimental results show that DeepCE achieves the superior performances not only in de novo chemical setting but also in traditional imputation setting compared to state-of-the-art baselines for the prediction of chemical-induced gene expression. We further verify the effectiveness of gene expression profiles generated from DeepCE by comparing them with gene expression profiles in L1000 dataset for downstream classification tasks including drug-target and disease predictions. To demonstrate the value of DeepCE, we apply it to patient-specific drug repurposing of COVID-19 for the first time, and generate novel lead compounds consistent with clinical evidences. Thus, DeepCE provides a potentially powerful framework for robust predictive modeling by utilizing noisy omics data as well as screening novel chemicals for the modulation of systemic response to disease.
Collapse
Affiliation(s)
- Thai-Hoang Pham
- The Ohio State University, Department of Computer Science and Engineering, Columbus, 43210, USA
| | - Yue Qiu
- The City University of New York, Ph.D. Program in Biology, The Graduate Center, New York, 10016, USA
| | - Jucheng Zeng
- The Ohio State University, Department of Biomedical Informatics, Columbus, 43210, USA
| | - Lei Xie
- The City University of New York, Ph.D. Program in Biology, The Graduate Center, New York, 10016, USA.,Hunter College, The City University of New York, Department of Computer Science, New York, 10065, USA.,The City University of New York, Ph.D. Program in Computer Science and Biochemistry, The Graduate Center, New York, 10016, USA.,Weill Cornell Medicine, Cornell University, Helen and Robert Appel Alzheimer's Disease Research Institute, Feil Family Brain Mind Research Institute, New York, 10021, USA
| | - Ping Zhang
- The Ohio State University, Department of Computer Science and Engineering, Columbus, 43210, USA.,The Ohio State University, Department of Biomedical Informatics, Columbus, 43210, USA
| |
Collapse
|
48
|
Missing Data Imputation for Geolocation-based Price Prediction Using KNN–MCF Method. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2020. [DOI: 10.3390/ijgi9040227] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Accurate house price forecasts are very important for formulating national economic policies. In this paper, we offer an effective method to predict houses’ sale prices. Our algorithm includes one-hot encoding to convert text data into numeric data, feature correlation to select only the most correlated variables, and a technique to overcome the missing data. Our approach is an effective way to handle missing data in large datasets with the K-nearest neighbor algorithm based on the most correlated features (KNN–MCF). As far as we are concerned, there has been no previous research that has focused on important features dealing with missing observations. Compared to the typical machine learning prediction algorithms, the prediction accuracy of the proposed method is 92.01% with the random forest algorithm, which is more efficient than the other methods.
Collapse
|
49
|
Schorr S, Nguyen D, Haßdenteufel S, Nagaraj N, Cavalié A, Greiner M, Weissgerber P, Loi M, Paton AW, Paton JC, Molinari M, Förster F, Dudek J, Lang S, Helms V, Zimmermann R. Identification of signal peptide features for substrate specificity in human Sec62/Sec63-dependent ER protein import. FEBS J 2020; 287:4612-4640. [PMID: 32133789 DOI: 10.1111/febs.15274] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2019] [Revised: 01/22/2020] [Accepted: 03/02/2020] [Indexed: 02/06/2023]
Abstract
In mammalian cells, one-third of all polypeptides are integrated into the membrane or translocated into the lumen of the endoplasmic reticulum (ER) via the Sec61 channel. While the Sec61 complex facilitates ER import of most precursor polypeptides, the Sec61-associated Sec62/Sec63 complex supports ER import in a substrate-specific manner. So far, mainly posttranslationally imported precursors and the two cotranslationally imported precursors of ERj3 and prion protein were found to depend on the Sec62/Sec63 complex in vitro. Therefore, we determined the rules for engagement of Sec62/Sec63 in ER import in intact human cells using a recently established unbiased proteomics approach. In addition to confirming ERj3, we identified 22 novel Sec62/Sec63 substrates under these in vivo-like conditions. As a common feature, those previously unknown substrates share signal peptides (SP) with comparatively longer but less hydrophobic hydrophobic region of SP and lower carboxy-terminal region of SP (C-region) polarity. Further analyses with four substrates, and ERj3 in particular, revealed the combination of a slowly gating SP and a downstream translocation-disruptive positively charged cluster of amino acid residues as decisive for the Sec62/Sec63 requirement. In the case of ERj3, these features were found to be responsible for an additional immunoglobulin heavy-chain binding protein (BiP) requirement and to correlate with sensitivity toward the Sec61-channel inhibitor CAM741. Thus, the human Sec62/Sec63 complex may support Sec61-channel opening for precursor polypeptides with slowly gating SPs by direct interaction with the cytosolic amino-terminal peptide of Sec61α or via recruitment of BiP and its interaction with the ER-lumenal loop 7 of Sec61α. These novel insights into the mechanism of human ER protein import contribute to our understanding of the etiology of SEC63-linked polycystic liver disease. DATABASES: The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository (http://www.ebi.ac.uk/pride/archive/projects/Identifiers) with the dataset identifiers: PXD008178, PXD011993, and PXD012078. Supplementary information was deposited at Mendeley Data (https://data.mendeley.com/datasets/6s5hn73jcv/2).
Collapse
Affiliation(s)
- Stefan Schorr
- Medical Biochemistry and Molecular Biology, Saarland University, Homburg, Germany
| | - Duy Nguyen
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
| | - Sarah Haßdenteufel
- Medical Biochemistry and Molecular Biology, Saarland University, Homburg, Germany
| | - Nagarjuna Nagaraj
- Core Facility, Max Planck Institute of Biochemistry, Martinsried, Germany
| | - Adolfo Cavalié
- Experimental and Clinical Pharmacology and Toxicology, Saarland University, Homburg, Germany
| | - Markus Greiner
- Medical Biochemistry and Molecular Biology, Saarland University, Homburg, Germany
| | - Petra Weissgerber
- Experimental and Clinical Pharmacology and Toxicology, Saarland University, Homburg, Germany
| | - Marisa Loi
- Faculty of Biomedical Sciences, Institute for Research in Biomedicine, Università della Svizzera italiana, Bellinzona, Switzerland
| | - Adrienne W Paton
- Research Centre for Infectious Diseases, University of Adelaide, SA, Australia
| | - James C Paton
- Research Centre for Infectious Diseases, University of Adelaide, SA, Australia
| | - Maurizio Molinari
- Faculty of Biomedical Sciences, Institute for Research in Biomedicine, Università della Svizzera italiana, Bellinzona, Switzerland
| | - Friedrich Förster
- Bijvoet Center for Biomolecular Research, Utrecht University, The Netherlands
| | - Johanna Dudek
- Medical Biochemistry and Molecular Biology, Saarland University, Homburg, Germany
| | - Sven Lang
- Medical Biochemistry and Molecular Biology, Saarland University, Homburg, Germany
| | - Volkhard Helms
- Center for Bioinformatics, Saarland University, Saarbrücken, Germany
| | - Richard Zimmermann
- Medical Biochemistry and Molecular Biology, Saarland University, Homburg, Germany
| |
Collapse
|
50
|
Zhang L, Zhang S. Comparison of Computational Methods for Imputing Single-Cell RNA-Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:376-389. [PMID: 29994128 DOI: 10.1109/tcbb.2018.2848633] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Single-cell RNA-sequencing (scRNA-seq) is a recent breakthrough technology, which paves the way for measuring RNA levels at single cell resolution to study precise biological functions. One of the main challenges when analyzing scRNA-seq data is the presence of zeros or dropout events, which may mislead downstream analyses. To compensate the dropout effect, several methods have been developed to impute gene expression since the first Bayesian-based method being proposed in 2016. However, these methods have shown very diverse characteristics in terms of model hypothesis and imputation performance. Thus, large-scale comparison and evaluation of these methods is urgently needed now. To this end, we compared eight imputation methods, evaluated their power in recovering original real data, and performed broad analyses to explore their effects on clustering cell types, detecting differentially expressed genes, and reconstructing lineage trajectories in the context of both simulated and real data. Simulated datasets and case studies highlight that there are no one method performs the best in all the situations. Some defects of these methods such as scalability, robustness, and unavailability in some situations need to be addressed in future studies.
Collapse
|