51
|
Organick L, Chen YJ, Dumas Ang S, Lopez R, Liu X, Strauss K, Ceze L. Probing the physical limits of reliable DNA data retrieval. Nat Commun 2020; 11:616. [PMID: 32001691 PMCID: PMC6992699 DOI: 10.1038/s41467-020-14319-8] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Accepted: 12/16/2019] [Indexed: 12/31/2022] Open
Abstract
Synthetic DNA is gaining momentum as a potential storage medium for archival data storage. In this process, digital information is translated into sequences of nucleotides and the resulting synthetic DNA strands are then stored for later retrieval. Here, we demonstrate reliable file recovery with PCR-based random access when as few as ten copies per sequence are stored, on average. This results in density of about 17 exabytes/gram, nearly two orders of magnitude greater than prior work has shown. We successfully retrieve the same data in a complex pool of over 1010 unique sequences per microliter with no evidence that we have begun to approach complexity limits. Finally, we also investigate the effects of file size and sequencing coverage on successful file retrieval and look for systematic DNA strand drop out. These findings substantiate the robustness and high data density of the process examined here.
Collapse
Affiliation(s)
- Lee Organick
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, 98195, USA.
| | | | | | - Randolph Lopez
- Department of Bioengineering, University of Washington, Seattle, WA, 98195, USA
| | - Xiaomeng Liu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, 98195, USA
| | | | - Luis Ceze
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, 98195, USA.
| |
Collapse
|
52
|
|
53
|
Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA. Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 2019. [DOI: 10.1007/s10115-019-01427-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
54
|
Du Y, Han G, Quan Y, Yu Z, Wong HS, Chen CLP, Zhang J. Exploiting Global Low-Rank Structure and Local Sparsity Nature for Tensor Completion. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:3898-3910. [PMID: 30047919 DOI: 10.1109/tcyb.2018.2853122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In the era of data science, a huge amount of data has emerged in the form of tensors. In many applications, the collected tensor data are incomplete with missing entries, which affects the analysis process. In this paper, we investigate a new method for tensor completion, in which a low-rank tensor approximation is used to exploit the global structure of data, and sparse coding is used for elucidating the local patterns of data. Regarding the characterization of low-rank structures, a weighted nuclear norm for the tensor is introduced. Meanwhile, an orthogonal dictionary learning process is incorporated into sparse coding for more effective discovery of the local details of data. By simultaneously using the global patterns and local cues, the proposed method can effectively and efficiently recover the lost information of incomplete tensor data. The capability of the proposed method is demonstrated with several experiments on recovering MRI data and visual data, and the experimental results have shown the excellent performance of the proposed method in comparison with recent related methods.
Collapse
|
55
|
Wang X, Shen S, Rasam SS, Qu J. MS1 ion current-based quantitative proteomics: A promising solution for reliable analysis of large biological cohorts. MASS SPECTROMETRY REVIEWS 2019; 38:461-482. [PMID: 30920002 PMCID: PMC6849792 DOI: 10.1002/mas.21595] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 02/28/2019] [Indexed: 05/04/2023]
Abstract
The rapidly-advancing field of pharmaceutical and clinical research calls for systematic, molecular-level characterization of complex biological systems. To this end, quantitative proteomics represents a powerful tool but an optimal solution for reliable large-cohort proteomics analysis, as frequently involved in pharmaceutical/clinical investigations, is urgently needed. Large-cohort analysis remains challenging owing to the deteriorating quantitative quality and snowballing missing data and false-positive discovery of altered proteins when sample size increases. MS1 ion current-based methods, which have become an important class of label-free quantification techniques during the past decade, show considerable potential to achieve reproducible protein measurements in large cohorts with high quantitative accuracy/precision. Nonetheless, in order to fully unleash this potential, several critical prerequisites should be met. Here we provide an overview of the rationale of MS1-based strategies and then important considerations for experimental and data processing techniques, with the emphasis on (i) efficient and reproducible sample preparation and LC separation; (ii) sensitive, selective and high-resolution MS detection; iii)accurate chromatographic alignment; (iv) sensitive and selective generation of quantitative features; and (v) optimal post-feature-generation data quality control. Prominent technical developments in these aspects are discussed. Finally, we reviewed applications of MS1-based strategy in disease mechanism studies, biomarker discovery, and pharmaceutical investigations.
Collapse
Affiliation(s)
- Xue Wang
- Department of Cell Stress BiologyRoswell Park Cancer InstituteBuffaloNew York
| | - Shichen Shen
- Department of Pharmaceutical SciencesUniversity at BuffaloState University of New YorkNew YorkNew York
| | - Sailee Suryakant Rasam
- Department of Biochemistry, University at BuffaloState University of New YorkNew YorkNew York
| | - Jun Qu
- Department of Cell Stress BiologyRoswell Park Cancer InstituteBuffaloNew York
- Department of Pharmaceutical SciencesUniversity at BuffaloState University of New YorkNew YorkNew York
- Department of Biochemistry, University at BuffaloState University of New YorkNew YorkNew York
| |
Collapse
|
56
|
ElGendy K, Malcomson FC, Bradburn DM, Mathers JC. Effects of bariatric surgery on DNA methylation in adults: a systematic review and meta-analysis. Surg Obes Relat Dis 2019; 16:128-136. [PMID: 31708383 DOI: 10.1016/j.soard.2019.09.075] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2019] [Revised: 09/24/2019] [Accepted: 09/27/2019] [Indexed: 01/06/2023]
Abstract
BACKGROUND DNA methylation is an epigenetic mechanism through which environmental factors, including obesity, influence health. Obesity is a major modifiable risk factor for many common diseases, including cardiovascular diseases and cancer. Obesity-induced metabolic stress and inflammation are key mechanisms that affect disease risk and that may result from changes in methylation of metabolic and inflammatory genes. OBJECTIVES This review aims to report the effects of weight loss induced by bariatric surgery (BS) on DNA methylation in adults with obesity focusing on changes in metabolic and inflammatory genes. METHODS A systematic review was performed using MEDLINE, EMBASE, and Scopus, to identify studies in adult humans that reported DNA methylation after BS. RESULTS Of 15,996 screened titles, 15 intervention studies were identified, all of which reported significantly lower body mass index postsurgery. DNA methylation was assessed in 5 different tissues (blood = 7 studies, adipose tissues = 4, skeletal muscle = 2, liver, and spermatozoa). Twelve studies reported significant changes in DNA methylation after BS. Meta-analysis showed that BS increased methylation of PDK4 loci in skeletal muscle and blood in 2 studies, while the effects of BS on IL6 methylation levels in blood were inconsistent. BS had no overall effect on LINE1 or PPARGC1 methylation. CONCLUSION The current evidence supports the reversibility of DNA methylation at specific loci in response to BS-induced weight loss. These changes are consistent with improved metabolic and inflammatory profiles of patients after BS. However, the evidence regarding the effects of BS on DNA methylation in humans is limited and inconsistent, which makes it difficult to combine and compare data across studies.
Collapse
Affiliation(s)
- Khalil ElGendy
- Human Nutrition Research Centre, Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne, United Kingdom; Surgery Department, Northumbria NHS Foundation Trust, Newcastle upon Tyne, United Kingdom.
| | - Fiona C Malcomson
- Human Nutrition Research Centre, Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - D Michael Bradburn
- Surgery Department, Northumbria NHS Foundation Trust, Newcastle upon Tyne, United Kingdom
| | - John C Mathers
- Human Nutrition Research Centre, Institute of Cellular Medicine, Newcastle University, Newcastle upon Tyne, United Kingdom
| |
Collapse
|
57
|
Tang J, Fu J, Wang Y, Luo Y, Yang Q, Li B, Tu G, Hong J, Cui X, Chen Y, Yao L, Xue W, Zhu F. Simultaneous Improvement in the Precision, Accuracy, and Robustness of Label-free Proteome Quantification by Optimizing Data Manipulation Chains. Mol Cell Proteomics 2019; 18:1683-1699. [PMID: 31097671 PMCID: PMC6682996 DOI: 10.1074/mcp.ra118.001169] [Citation(s) in RCA: 93] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 04/28/2019] [Indexed: 12/13/2022] Open
Abstract
The label-free proteome quantification (LFQ) is multistep workflow collectively defined by quantification tools and subsequent data manipulation methods that has been extensively applied in current biomedical, agricultural, and environmental studies. Despite recent advances, in-depth and high-quality quantification remains extremely challenging and requires the optimization of LFQs by comparatively evaluating their performance. However, the evaluation results using different criteria (precision, accuracy, and robustness) vary greatly, and the huge number of potential LFQs becomes one of the bottlenecks in comprehensively optimizing proteome quantification. In this study, a novel strategy, enabling the discovery of the LFQs of simultaneously enhanced performance from thousands of workflows (integrating 18 quantification tools with 3,128 manipulation chains), was therefore proposed. First, the feasibility of achieving simultaneous improvement in the precision, accuracy, and robustness of LFQ was systematically assessed by collectively optimizing its multistep manipulation chains. Second, based on a variety of benchmark datasets acquired by various quantification measurements of different modes of acquisition, this novel strategy successfully identified a number of manipulation chains that simultaneously improved the performance across multiple criteria. Finally, to further enhance proteome quantification and discover the LFQs of optimal performance, an online tool (https://idrblab.org/anpela/) enabling collective performance assessment (from multiple perspectives) of the entire LFQ workflow was developed. This study confirmed the feasibility of achieving simultaneous improvement in precision, accuracy, and robustness. The novel strategy proposed and validated in this study together with the online tool might provide useful guidance for the research field requiring the mass-spectrometry-based LFQ technique.
Collapse
Affiliation(s)
- Jing Tang
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China; ¶Department of Bioinformatics, Chongqing Medical University, Chongqing 400016, China
| | - Jianbo Fu
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yunxia Wang
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Qingxia Yang
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Bo Li
- §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Gao Tu
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Jiajun Hong
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xuejiao Cui
- §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Yuzong Chen
- ‖Department of Pharmacy, National University of Singapore, Singapore 117543, Singapore
| | - Lixia Yao
- **Department of Health Sciences Research, Mayo Clinic, Rochester MN 55905, United States
| | - Weiwei Xue
- §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Feng Zhu
- ‡College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China; §School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China.
| |
Collapse
|
58
|
Iwata M, Yuan L, Zhao Q, Tabei Y, Berenger F, Sawada R, Akiyoshi S, Hamano M, Yamanishi Y. Predicting drug-induced transcriptome responses of a wide range of human cell lines by a novel tensor-train decomposition algorithm. Bioinformatics 2019; 35:i191-i199. [PMID: 31510663 PMCID: PMC6612872 DOI: 10.1093/bioinformatics/btz313] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Genome-wide identification of the transcriptomic responses of human cell lines to drug treatments is a challenging issue in medical and pharmaceutical research. However, drug-induced gene expression profiles are largely unknown and unobserved for all combinations of drugs and human cell lines, which is a serious obstacle in practical applications. RESULTS Here, we developed a novel computational method to predict unknown parts of drug-induced gene expression profiles for various human cell lines and predict new drug therapeutic indications for a wide range of diseases. We proposed a tensor-train weighted optimization (TT-WOPT) algorithm to predict the potential values for unknown parts in tensor-structured gene expression data. Our results revealed that the proposed TT-WOPT algorithm can accurately reconstruct drug-induced gene expression data for a range of human cell lines in the Library of Integrated Network-based Cellular Signatures. The results also revealed that in comparison with the use of original gene expression profiles, the use of imputed gene expression profiles improved the accuracy of drug repositioning. We also performed a comprehensive prediction of drug indications for diseases with gene expression profiles, which suggested many potential drug indications that were not predicted by previous approaches. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michio Iwata
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| | - Longhao Yuan
- Graduate School of Engineering, Saitama Institute of Technology, Fukaya, Saitama, Japan
- RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan
| | - Qibin Zhao
- RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan
- School of Automation, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Yasuo Tabei
- RIKEN Center for Advanced Intelligence Project, Chuo-ku, Tokyo, Japan
| | - Francois Berenger
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| | - Ryusuke Sawada
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| | - Sayaka Akiyoshi
- Medical Institute of Bioregulation, Kyushu University, Higashi-ku, Fukuoka, Japan
| | - Momoko Hamano
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
| | - Yoshihiro Yamanishi
- Department of Bioscience and Bioinformatics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
- PRESTO Japan Science and Technology Agency, Kawaguchi, Saitama, Japan
| |
Collapse
|
59
|
Välikangas T, Suomi T, Elo LL. A comprehensive evaluation of popular proteomics software workflows for label-free proteome quantification and imputation. Brief Bioinform 2019; 19:1344-1355. [PMID: 28575146 PMCID: PMC6291797 DOI: 10.1093/bib/bbx054] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Indexed: 01/15/2023] Open
Abstract
Label-free mass spectrometry (MS) has developed into an important tool applied in various fields of biological and life sciences. Several software exist to process the raw MS data into quantified protein abundances, including open source and commercial solutions. Each software includes a set of unique algorithms for different tasks of the MS data processing workflow. While many of these algorithms have been compared separately, a thorough and systematic evaluation of their overall performance is missing. Moreover, systematic information is lacking about the amount of missing values produced by the different proteomics software and the capabilities of different data imputation methods to account for them.In this study, we evaluated the performance of five popular quantitative label-free proteomics software workflows using four different spike-in data sets. Our extensive testing included the number of proteins quantified and the number of missing values produced by each workflow, the accuracy of detecting differential expression and logarithmic fold change and the effect of different imputation and filtering methods on the differential expression results. We found that the Progenesis software performed consistently well in the differential expression analysis and produced few missing values. The missing values produced by the other software decreased their performance, but this difference could be mitigated using proper data filtering or imputation methods. Among the imputation methods, we found that the local least squares (lls) regression imputation consistently increased the performance of the software in the differential expression analysis, and a combination of both data filtering and local least squares imputation increased performance the most in the tested data sets.
Collapse
Affiliation(s)
- Tommi Välikangas
- Computational Biomedicine Group, Turku Centre for Biotechnology Finland
| | - Tomi Suomi
- Computational Biomedicine research group at the Turku Centre for Biotechnology Finland
| | - Laura L Elo
- Biomathematics, Research Director in Bioinformatics and Group Leader in Computational Biomedicine at Turku Centre for Biotechnology, University of Turku, Finland
| |
Collapse
|
60
|
Laishram A, Padmanabhan V. Discovery of user-item subgroups via genetic algorithm for effective prediction of ratings in collaborative filtering. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01495-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
61
|
de Campos LM, Cano A, Castellano JG, Moral S. Combining gene expression data and prior knowledge for inferring gene regulatory networks via Bayesian networks using structural restrictions. Stat Appl Genet Mol Biol 2019; 18:sagmb-2018-0042. [PMID: 31042646 DOI: 10.1515/sagmb-2018-0042] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.
Collapse
Affiliation(s)
- Luis M de Campos
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Andrés Cano
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Javier G Castellano
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| | - Serafín Moral
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
| |
Collapse
|
62
|
A nifty collaborative analysis to predicting a novel tool (DRFLLS) for missing values estimation. Soft comput 2019. [DOI: 10.1007/s00500-019-03972-x] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
63
|
Abstract
Analysis of genomic data is often complicated by the presence of missing values, which may arise due to cost or other reasons. The prevailing approach of single imputation is generally invalid if the imputation model is misspecified. In this paper, we propose a robust score statistic based on imputed data for testing the association between a phenotype and a genomic variable with (partially) missing values. We fit a semiparametric regression model for the genomic variable against an arbitrary function of the linear predictor in the phenotype model and impute each missing value by its estimated posterior expectation. We show that the score statistic with such imputed values is asymptotically unbiased under general missing-data mechanisms, even when the imputation model is misspecified. We develop a spline-based method to estimate the semiparametric imputation model and derive the asymptotic distribution of the corresponding score statistic with a consistent variance estimator using sieve approximation theory and empirical process theory. The proposed test is computationally feasible regardless of the number of independent variables in the imputation model. We demonstrate the advantages of the proposed method over existing methods through extensive simulation studies and provide an application to a major cancer genomics study.
Collapse
Affiliation(s)
- Kin Yau Wong
- Department of Applied Mathematics, The Hong Kong Polytechnic University, Hong Kong
| | - Donglin Zeng
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - D Y Lin
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA
| |
Collapse
|
64
|
Islam MS, Hoque MA, Islam MS, Ali M, Hossen MB, Binyamin M, Merican AF, Akazawa K, Kumar N, Sugimoto M. Mining Gene Expression Profile with Missing Values: An Integration of Kernel PCA and Robust Singular Values Decomposition. Curr Bioinform 2018. [DOI: 10.2174/1574893613666180413151654] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Background:
Gene expression profiling and transcriptomics provide valuable information
about the role of genes that are differentially expressed between two or more samples. It is always
important and challenging to analyse High-throughput DNA microarray data with a number of missing
values under various experimental conditions.
</P><P>
Objectives: Graphical data visualizations of the expression of all genes in a particular cell provide
holistic views of gene expression patterns, which improve our understanding of cellular systems under
normal and pathological conditions. However, current visualization methods are sensitive to missing
values, which are frequently observed in microarray-based gene expression profiling, potentially
affecting the subsequent statistical analyses.
Methods:
We addressed in this study the problem of missing values with respect to different imputation
methods using gene expression biplot (GE biplot), one of the most popular gene visualization
techniques. The effects of missing values for mining differentially expressed genes in gene expression
data were evaluated using four well-known imputation methods: Robust Singular Value Decomposition
(Robust SVD), Column Average (CA), Column Median (CM), and K-nearest Neighbors (KNN).
Frobenius norm and absolute distances were used to measure the accuracy of the methods.
Results:
Three numerical experiments were performed using simulated data (i) and publicly available colon
cancer (ii) and leukemia data (iii) to analyze the performance of each method. The results showed that CM and
KNN performed better than Robust SVD and CA for identifying the index gene profile in the biplot
visualization in both the simulation study and the colon cancer and leukemia microarray datasets.
Conclusion:
The impact of missing values on the GE biplot was smaller when the data matrix was
imputed by KNN than by CM. This study concluded that KNN performed satisfactorily in generating a
GE biplot in the presence of missing values in microarray data.
Collapse
Affiliation(s)
- Md. Saimul Islam
- Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh
| | - Md. Aminul Hoque
- Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh
| | - Md. Sahidul Islam
- Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh
| | - Mohammad Ali
- Statistics Discipline, Khulna University, Khulna-9208, Bangladesh
| | - Md. Bipul Hossen
- Department of Statistics, Begum Rokeya University, Rangpur-5400, Bangladesh
| | - Md. Binyamin
- Department of Statistics, Mawlana Bhashani Science and Technology University, Santosh, Tangail-1902, Bangladesh
| | - Amir Feisal Merican
- Institute of Biological Sciences, Faculty of Science and Centre of Research for Computational Sciences & Informatics for Biology, Bioindustry, Environment, Agriculture, and Healthcare (CRYSTAL), University of Malaya, Kuala Lumpur- 50603, Malaysia
| | - Kohei Akazawa
- Department of Medical Informatics, Niigata University Medical and Dental Hospital, Asahimachidori 1-754, Niigata 951-8520, Japan
| | - Nishith Kumar
- Department of Statistics, Bangabandhu Sheikh Mujibur Rahman Science and Technology University,Gopalganj, Bangladesh
| | - Masahiro Sugimoto
- Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh
| |
Collapse
|
65
|
Lee JY, Styczynski MP. NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics 2018; 14:153. [PMID: 30830437 PMCID: PMC6532628 DOI: 10.1007/s11306-018-1451-8] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 11/15/2018] [Indexed: 01/28/2023]
Abstract
INTRODUCTION A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection of the analytical instrumentation. OBJECTIVES Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations. METHODS We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data. RESULTS Our results show that NS-kNN typically outperforms kNN when at least 20-30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR. CONCLUSION Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.
Collapse
Affiliation(s)
- Justin Y Lee
- School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Drive, Atlanta, GA, 30332-0100, USA
| | - Mark P Styczynski
- School of Chemical & Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst Drive, Atlanta, GA, 30332-0100, USA.
| |
Collapse
|
66
|
Yan Y, Dai T, Yang M, Du X, Zhang Y, Zhang Y. Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique. Int J Mol Sci 2018; 19:ijms19113398. [PMID: 30380746 PMCID: PMC6274900 DOI: 10.3390/ijms19113398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Revised: 10/20/2018] [Accepted: 10/23/2018] [Indexed: 01/09/2023] Open
Abstract
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
Collapse
Affiliation(s)
- Yuanting Yan
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Tao Dai
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Meili Yang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Xiuquan Du
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yiwen Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yanping Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| |
Collapse
|
67
|
Effects of dietary interventions on DNA methylation in adult humans: systematic review and meta-analysis. Br J Nutr 2018; 120:961-976. [DOI: 10.1017/s000711451800243x] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
AbstractDNA methylation is a key component of the epigenetic machinery that is responsible for regulating gene expression and, therefore, cell function. Patterns of DNA methylation change during development and ageing, differ between cell types, are altered in multiple diseases and can be modulated by dietary factors. However, evidence about the effects of dietary factors on DNA methylation patterns in humans is fragmentary. This study was initiated to collate evidence for causal links between dietary factors and changes in DNA methylation patterns. We carried out a systematic review of dietary intervention studies in adult humans using Medline, EMBASE and Scopus. Out of 22 149 screened titles, sixty intervention studies were included, of which 65% were randomised (n 39). Most studies (53%) reported data from blood analyses, whereas 27% studied DNA methylation in colorectal mucosal biopsies. Folic acid was the most common intervention agent (33%). There was great heterogeneity in the methods used for assessing DNA methylation and in the genomic loci investigated. Meta-analysis of the effect of folic acid on global DNA methylation revealed strong evidence that supplementation caused hypermethylation in colorectal mucosa (P=0·009). Meta-regression analysis showed that the dose of supplementary folic acid was the only identified factor (P<0·001) showing a positive relationship. In summary, there is limited evidence from intervention studies of effects of dietary factors, other than folic acid, on DNA methylation patterns in humans. In addition, the application of multiple different assays and investigations of different genomic loci makes it difficult to compare, or to combine, data across studies.
Collapse
|
68
|
Choi HS, Choe JY, Kim H, Han JW, Chi YK, Kim K, Hong J, Kim T, Kim TH, Yoon S, Kim KW. Deep learning based low-cost high-accuracy diagnostic framework for dementia using comprehensive neuropsychological assessment profiles. BMC Geriatr 2018; 18:234. [PMID: 30285646 PMCID: PMC6171238 DOI: 10.1186/s12877-018-0915-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2017] [Accepted: 09/10/2018] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND The conventional scores of the neuropsychological batteries are not fully optimized for diagnosing dementia despite their variety and abundance of information. To achieve low-cost high-accuracy diagnose performance for dementia using a neuropsychological battery, a novel framework is proposed using the response profiles of 2666 cognitively normal elderly individuals and 435 dementia patients who have participated in the Korean Longitudinal Study on Cognitive Aging and Dementia (KLOSCAD). METHODS The key idea of the proposed framework is to propose a cost-effective and precise two-stage classification procedure that employed Mini Mental Status Examination (MMSE) as a screening test and the KLOSCAD Neuropsychological Assessment Battery as a diagnostic test using deep learning. In addition, an evaluation procedure of redundant variables is introduced to prevent performance degradation. A missing data imputation method is also presented to increase the robustness by recovering information loss. The proposed deep neural networks (DNNs) architecture for the classification is validated through rigorous evaluation in comparison with various classifiers. RESULTS The k-nearest-neighbor imputation has been induced according to the proposed framework, and the proposed DNNs for two stage classification show the best accuracy compared to the other classifiers. Also, 49 redundant variables were removed, which improved diagnostic performance and suggested the potential of simplifying the assessment. Using this two-stage framework, we could get 8.06% higher diagnostic accuracy of dementia than MMSE alone and 64.13% less cost than KLOSCAD-N alone. CONCLUSION The proposed framework could be applied to general dementia early detection programs to improve robustness, preciseness, and cost-effectiveness.
Collapse
Affiliation(s)
- Hyun-Soo Choi
- Department of Electrical and Computer Engineering, Seoul National University, room 908 Bldg. 301, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Korea
| | - Jin Yeong Choe
- Department of Brain and Cognitive Sciences, Seoul National University College of Natural Sciences, Seoul, Korea
| | - Hanjoo Kim
- Department of Electrical and Computer Engineering, Seoul National University, room 908 Bldg. 301, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Korea
| | - Ji Won Han
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Gyeonggi, 13620, Korea
| | - Yeon Kyung Chi
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Gyeonggi, 13620, Korea
| | - Kayoung Kim
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Gyeonggi, 13620, Korea
| | - Jongwoo Hong
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Gyeonggi, 13620, Korea
| | - Taehyun Kim
- Department of Neuropsychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Gyeonggi, 13620, Korea
| | - Tae Hui Kim
- Department of Psychiatry, Yonsei University Wonju Severance Christian Hospital, Wonju, Korea
| | - Sungroh Yoon
- Department of Electrical and Computer Engineering, Seoul National University, room 908 Bldg. 301, 1 Gwanak-ro, Gwanak-gu, Seoul, 08826, Korea.
| | - Ki Woong Kim
- Department of Brain and Cognitive Sciences, Seoul National University College of Natural Sciences, Seoul, Korea. .,Department of Neuropsychiatry, Seoul National University Bundang Hospital, 82 Gumi-ro 173beon-gil, Bundang-gu, Gyeonggi, 13620, Korea. .,Department of Psychiatry, Seoul National University College of Medicine, Seoul, Korea.
| |
Collapse
|
69
|
Nguyen D, Stutz R, Schorr S, Lang S, Pfeffer S, Freeze HH, Förster F, Helms V, Dudek J, Zimmermann R. Proteomics reveals signal peptide features determining the client specificity in human TRAP-dependent ER protein import. Nat Commun 2018; 9:3765. [PMID: 30217974 PMCID: PMC6138672 DOI: 10.1038/s41467-018-06188-z] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Accepted: 08/23/2018] [Indexed: 12/22/2022] Open
Abstract
In mammalian cells, one-third of all polypeptides are transported into or across the ER membrane via the Sec61 channel. While the Sec61 complex facilitates translocation of all polypeptides with amino-terminal signal peptides (SP) or transmembrane helices, the Sec61-auxiliary translocon-associated protein (TRAP) complex supports translocation of only a subset of precursors. To characterize determinants of TRAP substrate specificity, we here systematically identify TRAP-dependent precursors by analyzing cellular protein abundance changes upon TRAP depletion using quantitative label-free proteomics. The results are validated in independent experiments by western blotting, quantitative RT-PCR, and complementation analysis. The SPs of TRAP clients exhibit above-average glycine-plus-proline content and below-average hydrophobicity as distinguishing features. Thus, TRAP may act as SP receptor on the ER membrane’s cytosolic face, recognizing precursor polypeptides with SPs of high glycine-plus-proline content and/or low hydrophobicity, and triggering substrate-specific opening of the Sec61 channel through interactions with the ER-lumenal hinge of Sec61α. While Sec61 enables ER import of all polypeptides with N-terminal signal peptides, only selected clients are accepted for TRAP-assisted ER import. Here, the authors use a proteomics approach to characterize TRAP-dependent clients, identifying signal peptide features that govern recognition by TRAP.
Collapse
Affiliation(s)
- Duy Nguyen
- Center for Bioinformatics, Saarland University, 66041, Saarbrücken, Germany
| | - Regine Stutz
- Medical Biochemistry and Molecular Biology, Saarland University, 66421, Homburg, Germany
| | - Stefan Schorr
- Medical Biochemistry and Molecular Biology, Saarland University, 66421, Homburg, Germany
| | - Sven Lang
- Medical Biochemistry and Molecular Biology, Saarland University, 66421, Homburg, Germany
| | - Stefan Pfeffer
- Max-Planck Institute of Biochemistry, Department of Molecular Structural Biology, 82152, Martinsried, Germany
| | - Hudson H Freeze
- Sanford-Burnham-Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| | - Friedrich Förster
- Bijvoet Center for Biomolecular Research, Utrecht University, 3584, CH, Utrecht, The Netherlands
| | - Volkhard Helms
- Center for Bioinformatics, Saarland University, 66041, Saarbrücken, Germany.
| | - Johanna Dudek
- Medical Biochemistry and Molecular Biology, Saarland University, 66421, Homburg, Germany.
| | - Richard Zimmermann
- Medical Biochemistry and Molecular Biology, Saarland University, 66421, Homburg, Germany.
| |
Collapse
|
70
|
Chen X, Chen C, Cai Y, Wang H, Ye Q. Kernel Sparse Representation with Hybrid Regularization for On-Road Traffic Sensor Data Imputation. SENSORS 2018; 18:s18092884. [PMID: 30200348 PMCID: PMC6163639 DOI: 10.3390/s18092884] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 08/28/2018] [Accepted: 08/29/2018] [Indexed: 11/16/2022]
Abstract
The problem of missing values (MVs) in traffic sensor data analysis is universal in current intelligent transportation systems because of various reasons, such as sensor malfunction, transmission failure, etc. Accurate imputation of MVs is the foundation of subsequent data analysis tasks since most analysis algorithms need complete data as input. In this work, a novel MVs imputation approach termed as kernel sparse representation with elastic net regularization (KSR-EN) is developed for reconstructing MVs to facilitate analysis with traffic sensor data. The idea is to represent each sample as a linear combination of other samples due to inherent spatiotemporal correlation, as well as periodicity of daily traffic flow. To discover few yet correlated samples and make full use of the valuable information, a combination of l1-norm and l2-norm is employed to penalize the combination coefficients. Moreover, the linear representation among samples is extended to nonlinear representation by mapping input data space into high-dimensional feature space, which further enhances the recovery performance of our proposed approach. An efficient iterative algorithm is developed for solving KSR-EN model. The proposed method is verified on both an artificially simulated dataset and a public road network traffic sensor data. The results demonstrate the effectiveness of the proposed approach in terms of MVs imputation.
Collapse
Affiliation(s)
- Xiaobo Chen
- Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China.
- School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China.
| | - Cheng Chen
- School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China.
| | - Yingfeng Cai
- Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China.
| | - Hai Wang
- School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China.
| | - Qiaolin Ye
- College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China.
| |
Collapse
|
71
|
Chen X, Cai Y, Ye Q, Chen L, Li Z. Graph regularized local self-representation for missing value imputation with applications to on-road traffic sensor data. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.04.029] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
72
|
Urkup C, Bozkaya B, Salman FS. Customer mobility signatures and financial indicators as predictors in product recommendation. PLoS One 2018; 13:e0201197. [PMID: 30052681 PMCID: PMC6063431 DOI: 10.1371/journal.pone.0201197] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 07/10/2018] [Indexed: 11/19/2022] Open
Abstract
The rapid growth of mobile payment and geo-aware systems as well as the resulting emergence of Big Data present opportunities to explore individual consuming patterns across space and time. Here we analyze a one-year transaction dataset of a leading commercial bank to understand to what extent customer mobility behavior and financial indicators can predict the use of a target product, namely the Individual Consumer Loan product. After data preprocessing, we generate 13 datasets covering different time intervals and feature groups, and test combinations of 3 feature selection methods and 10 classification algorithms to determine, for each dataset, the best feature selection method and the most influential features, and the best classification algorithm. We observe the importance of spatio-temporal mobility features and financial features, in addition to demography, in predicting the use of this exemplary product with high accuracy (AUC = 0.942). Finally, we analyze the classification results and report on most interesting customer characteristics and product usage implications. Our findings can be used to potentially increase the success rates of product recommendation systems.
Collapse
Affiliation(s)
- Cagan Urkup
- Department of Industrial Engineering, Koç University, Istanbul, Turkey
| | - Burcin Bozkaya
- School of Management, Sabancı University, Istanbul, Turkey
| | - F. Sibel Salman
- Department of Industrial Engineering, Koç University, Istanbul, Turkey
- * E-mail:
| |
Collapse
|
73
|
Gong W, Kwak IY, Pota P, Koyano-Nakagawa N, Garry DJ. DrImpute: imputing dropout events in single cell RNA sequencing data. BMC Bioinformatics 2018; 19:220. [PMID: 29884114 PMCID: PMC5994079 DOI: 10.1186/s12859-018-2226-y] [Citation(s) in RCA: 187] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 05/30/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The single cell RNA sequencing (scRNA-seq) technique begin a new era by allowing the observation of gene expression at the single cell level. However, there is also a large amount of technical and biological noise. Because of the low number of RNA transcriptomes and the stochastic nature of the gene expression pattern, there is a high chance of missing nonzero entries as zero, which are called dropout events. RESULTS We develop DrImpute to impute dropout events in scRNA-seq data. We show that DrImpute has significantly better performance on the separation of the dropout zeros from true zeros than existing imputation algorithms. We also demonstrate that DrImpute can significantly improve the performance of existing tools for clustering, visualization and lineage reconstruction of nine published scRNA-seq datasets. CONCLUSIONS DrImpute can serve as a very useful addition to the currently existing statistical tools for single cell RNA-seq analysis. DrImpute is implemented in R and is available at https://github.com/gongx030/DrImpute .
Collapse
Affiliation(s)
- Wuming Gong
- Lillehei Heart Institute, University of Minnesota, 2231 6th St S.E, 4-165 CCRB, Minneapolis, MN 55114 USA
| | - Il-Youp Kwak
- Lillehei Heart Institute, University of Minnesota, 2231 6th St S.E, 4-165 CCRB, Minneapolis, MN 55114 USA
| | - Pruthvi Pota
- Lillehei Heart Institute, University of Minnesota, 2231 6th St S.E, 4-165 CCRB, Minneapolis, MN 55114 USA
| | - Naoko Koyano-Nakagawa
- Lillehei Heart Institute, University of Minnesota, 2231 6th St S.E, 4-165 CCRB, Minneapolis, MN 55114 USA
| | - Daniel J. Garry
- Lillehei Heart Institute, University of Minnesota, 2231 6th St S.E, 4-165 CCRB, Minneapolis, MN 55114 USA
| |
Collapse
|
74
|
Severson KA, Monian B, Love JC, Braatz RD. A method for learning a sparse classifier in the presence of missing data for high-dimensional biological datasets. Bioinformatics 2018; 33:2897-2905. [PMID: 28431087 DOI: 10.1093/bioinformatics/btx224] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 04/13/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation This work addresses two common issues in building classification models for biological or medical studies: learning a sparse model, where only a subset of a large number of possible predictors is used, and training in the presence of missing data. This work focuses on supervised generative binary classification models, specifically linear discriminant analysis (LDA). The parameters are determined using an expectation maximization algorithm to both address missing data and introduce priors to promote sparsity. The proposed algorithm, expectation-maximization sparse discriminant analysis (EM-SDA), produces a sparse LDA model for datasets with and without missing data. Results EM-SDA is tested via simulations and case studies. In the simulations, EM-SDA is compared with nearest shrunken centroids (NSCs) and sparse discriminant analysis (SDA) with k-nearest neighbors for imputation for varying mechanism and amount of missing data. In three case studies using published biomedical data, the results are compared with NSC and SDA models with four different types of imputation, all of which are common approaches in the field. EM-SDA is more accurate and sparse than competing methods both with and without missing data in most of the experiments. Furthermore, the EM-SDA results are mostly consistent between the missing and full cases. Biological relevance of the resulting models, as quantified via a literature search, is also presented. Availability and implementation A Matlab implementation published under GNU GPL v.3 license is available at http://web.mit.edu/braatzgroup/links.html . Contact braatz@mit.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kristen A Severson
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Brinda Monian
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - J Christopher Love
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Richard D Braatz
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
75
|
van Gennip Y, Hunter B, Ma A, Moyer D, de Vera R, Bertozzi AL. Unsupervised record matching with noisy and incomplete data. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2018. [DOI: 10.1007/s41060-018-0129-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
76
|
Wang A, Chen Y, An N, Yang J, Li L, Jiang L. Microarray Missing Value Imputation: A Regularized Local Learning Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:980-993. [PMID: 29994588 DOI: 10.1109/tcbb.2018.2810205] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Microarray experiments on gene expression inevitably generate missing values, which impedes further downstream biological analysis. Therefore, it is key to estimate the missing values accurately. Most of the existing imputation methods tend to suffer from the over-fitting problem. In this study, we propose two regularized local learning methods for microarray missing value imputation. Motivated by the grouping effect of L2 regularization, after selecting the target gene, we train an L2 Regularized Local Least Squares imputation model (RLLSimpute_L2) on the target gene and its neighbors to estimate the missing values of the target gene. Furthermore, RLLSimpute_L2 imputes the missing values in an ascending order based on the associated missing rate with each target gene. This contributes to fully utilizing the previously estimated values. Besides L2, we further explore L1 regularization and propose an L1 Regularized Local Least Squares imputation model (RLLSimpute_L1). To evaluate their effectiveness, we conducted extensive experimental studies on six benchmark datasets covering both time series and non-time series cases. Nine state-of-the-art imputation methods are compared with RLLSimpute_L2 and RLLSimpute_L1 in terms of three performance metrics. The comparative experimental results indicate that RLLSimpute_L2 outperforms its competitors by achieving smaller imputation errors and better structure preservation of differentially expressed genes.
Collapse
|
77
|
Aghdam R, Baghfalaki T, Khosravi P, Saberi Ansari E. The Ability of Different Imputation Methods to Preserve the Significant Genes and Pathways in Cancer. GENOMICS, PROTEOMICS & BIOINFORMATICS 2017; 15:396-404. [PMID: 29247873 PMCID: PMC5828654 DOI: 10.1016/j.gpb.2017.08.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/08/2017] [Revised: 07/18/2017] [Accepted: 08/08/2017] [Indexed: 11/23/2022]
Abstract
Deciphering important genes and pathways from incomplete gene expression data could facilitate a better understanding of cancer. Different imputation methods can be applied to estimate the missing values. In our study, we evaluated various imputation methods for their performance in preserving significant genes and pathways. In the first step, 5% genes are considered in random for two types of ignorable and non-ignorable missingness mechanisms with various missing rates. Next, 10 well-known imputation methods were applied to the complete datasets. The significance analysis of microarrays (SAM) method was applied to detect the significant genes in rectal and lung cancers to showcase the utility of imputation approaches in preserving significant genes. To determine the impact of different imputation methods on the identification of important genes, the chi-squared test was used to compare the proportions of overlaps between significant genes detected from original data and those detected from the imputed datasets. Additionally, the significant genes are tested for their enrichment in important pathways, using the ConsensusPathDB. Our results showed that almost all the significant genes and pathways of the original dataset can be detected in all imputed datasets, indicating that there is no significant difference in the performance of various imputation methods tested. The source code and selected datasets are available on http://profiles.bs.ipm.ir/softwares/imputation_methods/.
Collapse
Affiliation(s)
- Rosa Aghdam
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran 19395-5746, Iran.
| | - Taban Baghfalaki
- Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran 14115-111, Iran
| | - Pegah Khosravi
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran 19395-5746, Iran; Department of Physiology and Biophysics, Institute for Computational Biomedicine and Institute for Precision Medicine, Weill Cornell Medical College, New York, NY 10021, USA
| | - Elnaz Saberi Ansari
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran 19395-5746, Iran; Institut Cochin, Inserm U1016, CNRS UMR 8104, Universit Paris Descartes UMR-S1016, F-75014 Paris, France.
| |
Collapse
|
78
|
Taylor SL, Ruhaak LR, Kelly K, Weiss RH, Kim K. Effects of imputation on correlation: implications for analysis of mass spectrometry data from multiple biological matrices. Brief Bioinform 2017; 18:312-320. [PMID: 26896791 DOI: 10.1093/bib/bbw010] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Indexed: 11/14/2022] Open
Abstract
With expanded access to, and decreased costs of, mass spectrometry, investigators are collecting and analyzing multiple biological matrices from the same subject such as serum, plasma, tissue and urine to enhance biomarker discoveries, understanding of disease processes and identification of therapeutic targets. Commonly, each biological matrix is analyzed separately, but multivariate methods such as MANOVAs that combine information from multiple biological matrices are potentially more powerful. However, mass spectrometric data typically contain large amounts of missing values, and imputation is often used to create complete data sets for analysis. The effects of imputation on multiple biological matrix analyses have not been studied. We investigated the effects of seven imputation methods (half minimum substitution, mean substitution, k-nearest neighbors, local least squares regression, Bayesian principal components analysis, singular value decomposition and random forest), on the within-subject correlation of compounds between biological matrices and its consequences on MANOVA results. Through analysis of three real omics data sets and simulation studies, we found the amount of missing data and imputation method to substantially change the between-matrix correlation structure. The magnitude of the correlations was generally reduced in imputed data sets, and this effect increased with the amount of missing data. Significant results from MANOVA testing also were substantially affected. In particular, the number of false positives increased with the level of missing data for all imputation methods. No one imputation method was universally the best, but the simple substitution methods (Half Minimum and Mean) consistently performed poorly.
Collapse
Affiliation(s)
- Sandra L Taylor
- Division of Biostatistics, Department of Public Health Sciences, University of California School of Medicine, CA, USA
| | - L Renee Ruhaak
- Department of Chemistry, University of California, CA, USA
| | - Karen Kelly
- Division of Hematology and Oncology, University of California Davis Comprehensive Cancer Center , Sacramento, California, USA
| | - Robert H Weiss
- Division of Nephrology, Department of Internal Medicine, University of California, CA, USA
| | - Kyoungmi Kim
- Division of Biostatistics, Department of Public Health Sciences, University of California , California, USA
| |
Collapse
|
79
|
Armina R, Mohd Zain A, Ali NA, Sallehuddin R. A Review On Missing Value Estimation Using Imputation Algorithm. ACTA ACUST UNITED AC 2017. [DOI: 10.1088/1742-6596/892/1/012004] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
80
|
Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.06.010] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
81
|
Wang X, Shojaie A, Zhang Y, Shelley D, Lampe PD, Levy L, Peters U, Potter JD, White E, Lampe JW. Exploratory plasma proteomic analysis in a randomized crossover trial of aspirin among healthy men and women. PLoS One 2017; 12:e0178444. [PMID: 28542447 PMCID: PMC5444835 DOI: 10.1371/journal.pone.0178444] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 05/12/2017] [Indexed: 12/21/2022] Open
Abstract
Long-term use of aspirin is associated with lower risk of colorectal cancer and other cancers; however, the mechanism of chemopreventive effect of aspirin is not fully understood. Animal studies suggest that COX-2, NFκB signaling and Wnt/β-catenin pathways may play a role, but no clinical trials have systematically evaluated the biological response to aspirin in healthy humans. Using a high-density antibody array, we assessed the difference in plasma protein levels after 60 days of regular dose aspirin (325 mg/day) compared to placebo in a randomized double-blinded crossover trial of 44 healthy non-smoking men and women, aged 21-45 years. The plasma proteome was analyzed on an antibody microarray with ~3,300 full-length antibodies, printed in triplicate. Moderated paired t-tests were performed on individual antibodies, and gene-set analyses were performed based on KEGG and GO pathways. Among the 3,000 antibodies analyzed, statistically significant differences in plasma protein levels were observed for nine antibodies after adjusting for false discoveries (FDR adjusted p-value<0.1). The most significant protein was succinate dehydrogenase subunit C (SDHC), a key enzyme complex of the mitochondrial tricarboxylic acid (TCA) cycle. The other statistically significant proteins (NR2F1, MSI1, MYH1, FOXO1, KHDRBS3, NFKBIE, LYZ and IKZF1) are involved in multiple pathways, including DNA base-pair repair, inflammation and oncogenic pathways. None of the 258 KEGG and 1,139 GO pathways was found to be statistically significant after FDR adjustment. This study suggests several chemopreventive mechanisms of aspirin in humans, which have previously been reported to play a role in anti- or pro-carcinogenesis in cell systems; however, larger, confirmatory studies are needed.
Collapse
Affiliation(s)
- Xiaoliang Wang
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Ali Shojaie
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
| | - Yuzheng Zhang
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - David Shelley
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Paul D. Lampe
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Lisa Levy
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Ulrike Peters
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - John D. Potter
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Emily White
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Johanna W. Lampe
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| |
Collapse
|
82
|
Park JG, Paul S, Briones N, Zeng J, Gillis K, Wallstrom G, LaBaer J, Amundson SA. Developing Human Radiation Biodosimetry Models: Testing Cross-Species Conversion Approaches Using an Ex Vivo Model System. Radiat Res 2017; 187:708-721. [PMID: 28328310 DOI: 10.1667/rr14655.1] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
In the event of a large-scale radiation exposure, accurate and quick assessment of radiation dose received would be critical for triage and medical treatment of large numbers of potentially exposed individuals. Current methods of biodosimetry, such as the dicentric chromosome assay, are time consuming and require sophisticated equipment and highly trained personnel. Therefore, scalable biodosimetry approaches, including gene expression profiles in peripheral blood cells, are being investigated. Due to the limited availability of appropriate human samples, biodosimetry development has relied heavily on mouse models, which are not directly applicable to human response. Therefore, to explore the feasibility of using non-human primate (NHP) models to build and test a biodosimetry algorithm for use in humans, we irradiated ex vivo peripheral blood samples from both humans and rhesus macaques with doses of 0, 2, 5, 6 and 7 Gy, and compared the gene expression profiles 24 h later using Agilent human microarrays. Among the dose-responsive genes in human and using non-human primate, 52 genes showed highly correlated expression patterns between the species, and were enriched in p53/DNA damage response, apoptosis and cell cycle-related genes. When these interspecies-correlated genes were used to build biodosimetry models with using NHP data, the mean prediction accuracy on non-human primate samples was about 90% within 1 Gy of delivered dose in leave-one-out cross-validation. However, tests on human samples suggested that human gene expression values may need to be adjusted prior to application of the NHP model. A "multi-gene" approach utilizing all gene values for cross-species conversion and applying the converted values on the NHP biodosimetry models, gave a leave-one-out cross-validation prediction accuracy for human samples highly comparable (up to 94%) to that for non-human primates. Overall, this study demonstrates that a robust NHP biodosimetry model can be built using interspecies-correlated genes, and that, by using multiple regression-based cross-species conversion of expression values, absorbed dose in human samples can be accurately predicted by the NHP model.
Collapse
Affiliation(s)
- Jin G Park
- a Biodesign Center for Personalized Diagnostic, Biodesign Institute, Arizona State University, Arizona
| | - Sunirmal Paul
- d Center for Radiological Research, Columbia University Medical Center, New York
| | - Natalia Briones
- a Biodesign Center for Personalized Diagnostic, Biodesign Institute, Arizona State University, Arizona
| | - Jia Zeng
- a Biodesign Center for Personalized Diagnostic, Biodesign Institute, Arizona State University, Arizona.,b Department of Biomedical Informatics, Arizona State University, Arizona
| | - Kristin Gillis
- a Biodesign Center for Personalized Diagnostic, Biodesign Institute, Arizona State University, Arizona
| | - Garrick Wallstrom
- a Biodesign Center for Personalized Diagnostic, Biodesign Institute, Arizona State University, Arizona.,b Department of Biomedical Informatics, Arizona State University, Arizona
| | - Joshua LaBaer
- a Biodesign Center for Personalized Diagnostic, Biodesign Institute, Arizona State University, Arizona.,c School of Molecular Sciences, Arizona State University, Arizona
| | - Sally A Amundson
- d Center for Radiological Research, Columbia University Medical Center, New York
| |
Collapse
|
83
|
|
84
|
Yu Z, Li T, Horng SJ, Pan Y, Wang H, Jing Y. An Iterative Locally Auto-Weighted Least Squares Method for Microarray Missing Value Estimation. IEEE Trans Nanobioscience 2017; 16:21-33. [PMID: 28114029 DOI: 10.1109/tnb.2016.2636243] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Microarray data often contain missing values which significantly affect subsequent analysis. Existing LLSimpute-based imputation methods for dealing with missing data have been shown to be generally efficient. However, all of the LLSimpute-based methods do not consider the different importance of different neighbors of the target gene in the missing value estimation process and treat all the neighbors equally. In this paper, a locally auto-weighted least squares imputation (LAW-LSimpute) method is proposed for missing value estimation, which can automatically weight the neighboring genes based on the importance of the genes. Then, an accelerating strategy is added to the LAW-LSimpute method in order to improve the convergence. Furthermore, an iterative missing value estimation framework of LAW-LSimpute (ILAW-LSimpute) is designed. Experimental results show that the ILAW-LSimpute method is able to reduce the estimation error.
Collapse
|
85
|
Wu WS, Jhou MJ. MVIAeval: a web tool for comprehensively evaluating the performance of a new missing value imputation algorithm. BMC Bioinformatics 2017; 18:31. [PMID: 28086746 PMCID: PMC5237319 DOI: 10.1186/s12859-016-1429-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Accepted: 12/15/2016] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Missing value imputation is important for microarray data analyses because microarray data with missing values would significantly degrade the performance of the downstream analyses. Although many microarray missing value imputation algorithms have been developed, an objective and comprehensive performance comparison framework is still lacking. To solve this problem, we previously proposed a framework which can perform a comprehensive performance comparison of different existing algorithms. Also the performance of a new algorithm can be evaluated by our performance comparison framework. However, constructing our framework is not an easy task for the interested researchers. To save researchers' time and efforts, here we present an easy-to-use web tool named MVIAeval (Missing Value Imputation Algorithm evaluator) which implements our performance comparison framework. RESULTS MVIAeval provides a user-friendly interface allowing users to upload the R code of their new algorithm and select (i) the test datasets among 20 benchmark microarray (time series and non-time series) datasets, (ii) the compared algorithms among 12 existing algorithms, (iii) the performance indices from three existing ones, (iv) the comprehensive performance scores from two possible choices, and (v) the number of simulation runs. The comprehensive performance comparison results are then generated and shown as both figures and tables. CONCLUSIONS MVIAeval is a useful tool for researchers to easily conduct a comprehensive and objective performance evaluation of their newly developed missing value imputation algorithm for microarray data or any data which can be represented as a matrix form (e.g. NGS data or proteomics data). Thus, MVIAeval will greatly expedite the progress in the research of missing value imputation algorithms.
Collapse
Affiliation(s)
- Wei-Sheng Wu
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan.
| | - Meng-Jhun Jhou
- Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan
| |
Collapse
|
86
|
Pietrocola F, Demont Y, Castoldi F, Enot D, Durand S, Semeraro M, Baracco EE, Pol J, Bravo-San Pedro JM, Bordenave C, Levesque S, Humeau J, Chery A, Métivier D, Madeo F, Maiuri MC, Kroemer G. Metabolic effects of fasting on human and mouse blood in vivo. Autophagy 2017; 13:567-578. [PMID: 28059587 DOI: 10.1080/15548627.2016.1271513] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Starvation is a strong physiological stimulus of macroautophagy/autophagy. In this study, we addressed the question as to whether it would be possible to measure autophagy in blood cells after nutrient deprivation. Fasting of mice for 48 h (which causes ∼20% weight loss) or starvation of human volunteers for up to 4 d (which causes <2% weight loss) provokes major changes in the plasma metabolome, yet induces only relatively minor alterations in the intracellular metabolome of circulating leukocytes. White blood cells from mice and human volunteers responded to fasting with a marked reduction in protein lysine acetylation, affecting both nuclear and cytoplasmic compartments. In circulating leukocytes from mice that underwent 48-h fasting, an increase in LC3B lipidation (as assessed by immunoblotting and immunofluorescence) only became detectable if the protease inhibitor leupeptin was injected 2 h before drawing blood. Consistently, measurement of an enhanced autophagic flux was only possible if white blood cells from starved human volunteers were cultured in the presence or absence of leupeptin. Whereas all murine leukocyte subpopulations significantly increased the number of LC3B+ puncta per cell in response to nutrient deprivation, only neutrophils from starved volunteers showed signs of activated autophagy (as determined by a combination of multi-color immunofluorescence, cytofluorometry and image analysis). Altogether, these results suggest that white blood cells are suitable for monitoring autophagic flux. In addition, we propose that the evaluation of protein acetylation in circulating leukocytes can be adopted as a biochemical marker of organismal energetic status.
Collapse
Affiliation(s)
- Federico Pietrocola
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Yohann Demont
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France
| | - Francesca Castoldi
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France.,f Sotio a.c. ; Prague , Czech Republic
| | - David Enot
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Sylvère Durand
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Michaela Semeraro
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,e Centre d'Investigation Clinique-Unité de Recherche Clinique Paris Centre Necker-Cochin, Assistance Publique Hôpitaux de Paris , France
| | - Elisa Elena Baracco
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Jonathan Pol
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Jose Manuel Bravo-San Pedro
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Chloé Bordenave
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Sarah Levesque
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Juliette Humeau
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Alexis Chery
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Didier Métivier
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France
| | - Frank Madeo
- g Institute of Molecular Biosciences, NAWI Graz, University of Graz , Graz , Austria.,h BioTechMed-Graz , Graz , Austria
| | - M Chiara Maiuri
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France
| | - Guido Kroemer
- a Equipe 11 labellisée Ligue contre le Cancer, Centre de Recherche des Cordeliers, INSERM U 1138 , Paris , France.,b Université Paris Descartes, Sorbonne Paris Cité , Paris , France.,c Université Pierre et Marie Curie , Paris , France.,d Metabolomics and Cell Biology Platforms, Gustave Roussy Comprehensive Cancer Institute , Villejuif , France.,i Pôle de Biologie, Hôpital Européen Georges Pompidou, AP-HP , Paris , France.,j Karolinska Institute, Department of Women's and Children's Health , Karolinska University Hospital , Stockholm , Sweden
| |
Collapse
|
87
|
Gibert K, Sànchez–Marrè M, Izquierdo J. A survey on pre-processing techniques: Relevant issues in the context of environmental data mining. AI COMMUN 2016. [DOI: 10.3233/aic-160710] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Karina Gibert
- Knowledge Engineering and Machine Learning Group, Department of Statistics and Operation Research, Universitat Politècnica de Catalunya-BarcelonaTech, Barcelona, Catalonia, Spain
| | - Miquel Sànchez–Marrè
- Knowledge Engineering and Machine Learning Group, Computer Science Department, Universitat Politècnica de Catalunya-BarcelonaTech, Barcelona, Catalonia, Spain
| | | |
Collapse
|
88
|
|
89
|
Cai T, Cai TT, Zhang A. Structured Matrix Completion with Applications to Genomic Data Integration. J Am Stat Assoc 2016; 111:621-633. [PMID: 28042188 PMCID: PMC5198844 DOI: 10.1080/01621459.2015.1021005] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Revised: 01/01/2015] [Indexed: 10/23/2022]
Abstract
Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival.
Collapse
Affiliation(s)
- Tianxi Cai
- Professor of Biostatistics, Department of Biostatistics, Harvard University, Boston, MA
| | - T Tony Cai
- Dorothy Silberberg Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| | - Anru Zhang
- Student, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
90
|
Chen Y, Wang A, Ding H, Que X, Li Y, An N, Jiang L. A global learning with local preservation method for microarray data imputation. Comput Biol Med 2016; 77:76-89. [PMID: 27522236 DOI: 10.1016/j.compbiomed.2016.08.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2016] [Revised: 08/04/2016] [Accepted: 08/04/2016] [Indexed: 12/28/2022]
Abstract
Microarray data suffer from missing values for various reasons, including insufficient resolution, image noise, and experimental errors. Because missing values can hinder downstream analysis steps that require complete data as input, it is crucial to be able to estimate the missing values. In this study, we propose a Global Learning with Local Preservation method (GL2P) for imputation of missing values in microarray data. GL2P consists of two components: a local similarity measurement module and a global weighted imputation module. The former uses a local structure preservation scheme to exploit as much information as possible from the observable data, and the latter is responsible for estimating the missing values of a target gene by considering all of its neighbors rather than a subset of them. Furthermore, GL2P imputes the missing values in ascending order according to the rate of missing data for each target gene to fully utilize previously estimated values. To validate the proposed method, we conducted extensive experiments on six benchmarked microarray datasets. We compared GL2P with eight state-of-the-art imputation methods in terms of four performance metrics. The experimental results indicate that GL2P outperforms its competitors in terms of imputation accuracy and better preserves the structure of differentially expressed genes. In addition, GL2P is less sensitive to the number of neighbors than other local learning-based imputation methods.
Collapse
Affiliation(s)
- Ye Chen
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Aiguo Wang
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China; School of Software, Hefei University of Technology, Hefei 230009, China.
| | - Huitong Ding
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Xia Que
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Yabo Li
- College of Life Sciences, Lanzhou University, Lanzhou 730000, China.
| | - Ning An
- School of Computer and Information, Hefei University of Technology, Hefei 230009, China.
| | - Lili Jiang
- Department of Computing Science, Umeå University, Umeå 90187, Sweden.
| |
Collapse
|
91
|
Lin D, Zhang J, Li J, Xu C, Deng HW, Wang YP. An integrative imputation method based on multi-omics datasets. BMC Bioinformatics 2016; 17:247. [PMID: 27329642 PMCID: PMC4915152 DOI: 10.1186/s12859-016-1122-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 06/05/2016] [Indexed: 12/26/2022] Open
Abstract
Background Integrative analysis of multi-omics data is becoming increasingly important to unravel functional mechanisms of complex diseases. However, the currently available multi-omics datasets inevitably suffer from missing values due to technical limitations and various constrains in experiments. These missing values severely hinder integrative analysis of multi-omics data. Current imputation methods mainly focus on using single omics data while ignoring biological interconnections and information imbedded in multi-omics data sets. Results In this study, a novel multi-omics imputation method was proposed to integrate multiple correlated omics datasets for improving the imputation accuracy. Our method was designed to: 1) combine the estimates of missing value from individual omics data itself as well as from other omics, and 2) simultaneously impute multiple missing omics datasets by an iterative algorithm. We compared our method with five imputation methods using single omics data at different noise levels, sample sizes and data missing rates. The results demonstrated the advantage and efficiency of our method, consistently in terms of the imputation error and the recovery of mRNA-miRNA network structure. Conclusions We concluded that our proposed imputation method can utilize more biological information to minimize the imputation error and thus can improve the performance of downstream analysis such as genetic regulatory network construction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1122-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Dongdong Lin
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA.,Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Jigang Zhang
- Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA.,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA
| | - Jingyao Li
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA.,Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA
| | - Chao Xu
- Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA.,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA
| | - Hong-Wen Deng
- Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA.,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA
| | - Yu-Ping Wang
- Department of Biomedical Engineering, Tulane University, New Orleans, LA, 70118, USA. .,Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA, 70112, USA. .,Department of Biostatistics and Bioinformatics, Tulane University, New Orleans, LA, 70112, USA.
| |
Collapse
|
92
|
Kapur A, Marwah K, Alterovitz G. Gene expression prediction using low-rank matrix completion. BMC Bioinformatics 2016; 17:243. [PMID: 27317252 PMCID: PMC4912738 DOI: 10.1186/s12859-016-1106-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Accepted: 05/28/2016] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND An exponential growth of high-throughput biological information and data has occurred in the past decade, supported by technologies, such as microarrays and RNA-Seq. Most data generated using such methods are used to encode large amounts of rich information, and determine diagnostic and prognostic biomarkers. Although data storage costs have reduced, process of capturing data using aforementioned technologies is still expensive. Moreover, the time required for the assay, from sample preparation to raw value measurement is excessive (in the order of days). There is an opportunity to reduce both the cost and time for generating such expression datasets. RESULTS We propose a framework in which complete gene expression values can be reliably predicted in-silico from partial measurements. This is achieved by modelling expression data as a low-rank matrix and then applying recently discovered techniques of matrix completion by using nonlinear convex optimisation. We evaluated prediction of gene expression data based on 133 studies, sourced from a combined total of 10,921 samples. It is shown that such datasets can be constructed with a low relative error even at high missing value rates (>50 %), and that such predicted datasets can be reliably used as surrogates for further analysis. CONCLUSION This method has potentially far-reaching applications including how bio-medical data is sourced and generated, and transcriptomic prediction by optimisation. We show that gene expression data can be computationally constructed, thereby potentially reducing the costs of gene expression profiling. In conclusion, this method shows great promise of opening new avenues in research on low-rank matrix completion in biological sciences.
Collapse
Affiliation(s)
- Arnav Kapur
- />Biomedical Cybernetics Laboratory, Harvard Medical School, Boston, 02115 MA USA
| | - Kshitij Marwah
- />Biomedical Cybernetics Laboratory, Harvard Medical School, Boston, 02115 MA USA
| | - Gil Alterovitz
- />Biomedical Cybernetics Laboratory, Harvard Medical School, Boston, 02115 MA USA
- />Department of Health Science and Technology, Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, 02139 MA USA
| |
Collapse
|
93
|
McCoin CS, Piccolo BD, Knotts TA, Matern D, Vockley J, Gillingham MB, Adams SH. Unique plasma metabolomic signatures of individuals with inherited disorders of long-chain fatty acid oxidation. J Inherit Metab Dis 2016; 39:399-408. [PMID: 26907176 PMCID: PMC4851894 DOI: 10.1007/s10545-016-9915-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 01/09/2016] [Accepted: 01/22/2016] [Indexed: 01/29/2023]
Abstract
Blood and urine acylcarnitine profiles are commonly used to diagnose long-chain fatty acid oxidation disorders (FAOD: i.e., long-chain hydroxy-acyl-CoA dehydrogenase [LCHAD] and carnitine palmitoyltransferase 2 [CPT2] deficiency), but the global metabolic impact of long-chain FAOD has not been reported. We utilized untargeted metabolomics to characterize plasma metabolites in 12 overnight-fasted individuals with FAOD (10 LCHAD, two CPT2) and 11 healthy age-, sex-, and body mass index (BMI)-matched controls, with the caveat that individuals with FAOD consume a low-fat diet supplemented with medium-chain triglycerides (MCT) while matched controls consume a typical American diet. In plasma 832 metabolites were identified, and partial least squared-discriminant analysis (PLS-DA) identified 114 non-acylcarnitine variables that discriminated FAOD subjects and controls. FAOD individuals had significantly higher triglycerides and lower specific phosphatidylethanolamines, ceramides, and sphingomyelins. Differences in phosphatidylcholines were also found but the directionality differed by metabolite species. Further, there were few differences in non-lipid metabolites, indicating the metabolic impact of FAOD specifically on lipid pathways. This analysis provides evidence that LCHAD/CPT2 deficiency significantly alters complex lipid pathway flux. This metabolic signature may provide new clinical tools capable of confirming or diagnosing FAOD, even in subjects with a mild phenotype, and may provide clues regarding the biochemical and metabolic impact of FAOD that is relevant to the etiology of FAOD symptoms.
Collapse
Affiliation(s)
- Colin S McCoin
- Molecular, Cellular and Integrative Physiology Graduate Group, University of California, Davis, CA, USA
| | - Brian D Piccolo
- Arkansas Children's Nutrition Center and Department of Pediatrics, University of Arkansas for Medical Sciences, 15 Children's Way, Little Rock, AR, 72202, USA
| | - Trina A Knotts
- Department of Molecular Biosciences, School of Veterinary Medicine, University of California, Davis, CA, USA
| | - Dietrich Matern
- Biochemical Genetics Laboratory, Mayo Clinic, Rochester, MN, USA
| | - Jerry Vockley
- Department of Pediatrics, School of Medicine, Children's Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, PA, USA
- Department of Human Genetics, Graduate School of Public Health, Pittsburgh, PA, USA
| | - Melanie B Gillingham
- Department of Molecular & Medical Genetics and Graduate Programs in Human Nutrition, Oregon Health & Science University, Portland, OR, USA
| | - Sean H Adams
- Molecular, Cellular and Integrative Physiology Graduate Group, University of California, Davis, CA, USA.
- Arkansas Children's Nutrition Center and Department of Pediatrics, University of Arkansas for Medical Sciences, 15 Children's Way, Little Rock, AR, 72202, USA.
| |
Collapse
|
94
|
Zhang G, Huang KC, Xu Z, Tzeng JY, Conneely KN, Guan W, Kang J, Li Y. Across-Platform Imputation of DNA Methylation Levels Incorporating Nonlocal Information Using Penalized Functional Regression. Genet Epidemiol 2016; 40:333-40. [PMID: 27061717 PMCID: PMC4862742 DOI: 10.1002/gepi.21969] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2015] [Revised: 02/03/2016] [Accepted: 02/18/2016] [Indexed: 12/28/2022]
Abstract
DNA methylation is a key epigenetic mark involved in both normal development and disease progression. Recent advances in high-throughput technologies have enabled genome-wide profiling of DNA methylation. However, DNA methylation profiling often employs different designs and platforms with varying resolution, which hinders joint analysis of methylation data from multiple platforms. In this study, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from nonlocal probes to improve imputation quality. Here, we compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations. Specifically, we applied different imputation approaches to an acute myeloid leukemia dataset consisting of 194 samples and our method showed higher imputation accuracy, manifested, for example, by a 94% relative increase in information content and up to 86% more CpG sites passing post-imputation filtering. Our simulated association study further demonstrated that our method substantially improves the statistical power to identify trait-associated methylation loci. These findings indicate that the penalized functional regression model is a convenient and valuable imputation tool for methylation data, and it can boost statistical power in downstream epigenome-wide association study (EWAS).
Collapse
Affiliation(s)
- Guosheng Zhang
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Statistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Kuan-Chieh Huang
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Zheng Xu
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Jung-Ying Tzeng
- Department of Statistics, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Karen N Conneely
- Department of Human Genetics, School of Medicine, Emory University, Atlanta, Georgia, United States of America
| | - Weihua Guan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minnesota, United States of America
| | - Jian Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yun Li
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina, United States of America
| |
Collapse
|
95
|
Yang Y, Xu Z, Song D. Missing value imputation for microRNA expression data by using a GO-based similarity measure. BMC Bioinformatics 2016; 17 Suppl 1:10. [PMID: 26818962 PMCID: PMC4895707 DOI: 10.1186/s12859-015-0853-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Missing values are commonly present in microarray data profiles. Instead of discarding genes or samples with incomplete expression level, missing values need to be properly imputed for accurate data analysis. The imputation methods can be roughly categorized as expression level-based and domain knowledge-based. The first type of methods only rely on expression data without the help of external data sources, while the second type incorporates available domain knowledge into expression data to improve imputation accuracy. In recent years, microRNA (miRNA) microarray has been largely developed and used for identifying miRNA biomarkers in complex human disease studies. Similar to mRNA profiles, miRNA expression profiles with missing values can be treated with the existing imputation methods. However, the domain knowledge-based methods are hard to be applied due to the lack of direct functional annotation for miRNAs. With the rapid accumulation of miRNA microarray data, it is increasingly needed to develop domain knowledge-based imputation algorithms specific to miRNA expression profiles to improve the quality of miRNA data analysis. RESULTS We connect miRNAs with domain knowledge of Gene Ontology (GO) via their target genes, and define miRNA functional similarity based on the semantic similarity of GO terms in GO graphs. A new measure combining miRNA functional similarity and expression similarity is used in the imputation of missing values. The new measure is tested on two miRNA microarray datasets from breast cancer research and achieves improved performance compared with the expression-based method on both datasets. CONCLUSIONS The experimental results demonstrate that the biological domain knowledge can benefit the estimation of missing values in miRNA profiles as well as mRNA profiles. Especially, functional similarity defined by GO terms annotated for the target genes of miRNAs can be useful complementary information for the expression-based method to improve the imputation accuracy of miRNA array data. Our method and data are available to the public upon request.
Collapse
Affiliation(s)
- Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai, 200240, China. .,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China.
| | - Zhuangdi Xu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dongchuan Rd., Shanghai, 200240, China.
| | - Dandan Song
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing100081, China.
| |
Collapse
|
96
|
|
97
|
Gao L, Pei G, Chen L, Zhang W. A global network-based protocol for functional inference of hypothetical proteins in Synechocystis sp. PCC 6803. J Microbiol Methods 2015; 116:44-52. [DOI: 10.1016/j.mimet.2015.06.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Revised: 06/24/2015] [Accepted: 06/25/2015] [Indexed: 01/15/2023]
|
98
|
Automatic instance selection via locality constrained sparse representation for missing value estimation. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.05.007] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
99
|
Haakensen VD, Steinfeld I, Saldova R, Shehni AA, Kifer I, Naume B, Rudd PM, Børresen-Dale AL, Yakhini Z. Serum N-glycan analysis in breast cancer patients--Relation to tumour biology and clinical outcome. Mol Oncol 2015; 10:59-72. [PMID: 26321095 DOI: 10.1016/j.molonc.2015.08.002] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Revised: 08/02/2015] [Accepted: 08/03/2015] [Indexed: 12/13/2022] Open
Abstract
Glycosylation and related processes play important roles in cancer development and progression, including metastasis. Several studies have shown that N-glycans have potential diagnostic value as cancer serum biomarkers. We have explored the significance of the abundance of particular serum N-glycan structures as important features of breast tumour biology by studying the serum glycome and tumour transcriptome (mRNA and miRNA) of 104 breast cancer patients. Integration of these types of molecular data allows us to study the relationship between serum glycans and transcripts representing functional pathways, such as metabolic pathways or DNA damage response. We identified tri antennary trigalactosylated trisialylated glycans in serum as being associated with lower levels of tumour transcripts involved in focal adhesion and integrin-mediated cell adhesion. These glycan structures were also linked to poor prognosis in patients with ER negative tumours. High abundance of simple monoantennary glycan structures were associated with increased survival, particularly in the basal-like subgroup. The presence of circulating tumour cells was found to be significantly associated with several serum glycome structures like bi and triantennary, di- and trigalactosylated, di- and trisialylated. The link between tumour miRNA expression levels and N-glycan production is also examined.
Collapse
Affiliation(s)
- Vilde D Haakensen
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway; The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | - Israel Steinfeld
- Department of Computer Science, Technion, Haifa, Israel; Agilent Laboratories, Agilent Technologies, Tel-Aviv, Israel
| | - Radka Saldova
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Fosters Avenue, Mount Merrion, Blackrock, Dublin 4, Ireland
| | - Akram Asadi Shehni
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Fosters Avenue, Mount Merrion, Blackrock, Dublin 4, Ireland
| | - Ilona Kifer
- Agilent Laboratories, Agilent Technologies, Tel-Aviv, Israel
| | - Bjørn Naume
- Department of Oncology, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway
| | - Pauline M Rudd
- NIBRT GlycoScience Group, National Institute for Bioprocessing Research and Training, Fosters Avenue, Mount Merrion, Blackrock, Dublin 4, Ireland
| | - Anne-Lise Børresen-Dale
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital, The Norwegian Radium Hospital, Oslo, Norway; The K.G. Jebsen Center for Breast Cancer Research, Institute for Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway.
| | - Zohar Yakhini
- Department of Computer Science, Technion, Haifa, Israel; Agilent Laboratories, Agilent Technologies, Tel-Aviv, Israel.
| |
Collapse
|
100
|
Li H, Zhao C, Shao F, Li GZ, Wang X. A hybrid imputation approach for microarray missing value estimation. BMC Genomics 2015; 16 Suppl 9:S1. [PMID: 26330180 PMCID: PMC4547405 DOI: 10.1186/1471-2164-16-s9-s1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Missing data is an inevitable phenomenon in gene expression microarray experiments due to instrument failure or human error. It has a negative impact on performance of downstream analysis. Technically, most existing approaches suffer from this prevalent problem. Imputation is one of the frequently used methods for processing missing data. Actually many developments have been achieved in the research on estimating missing values. The challenging task is how to improve imputation accuracy for data with a large missing rate. METHODS In this paper, induced by the thought of collaborative training, we propose a novel hybrid imputation method, called Recursive Mutual Imputation (RMI). Specifically, RMI exploits global correlation information and local structure in the data, captured by two popular methods, Bayesian Principal Component Analysis (BPCA) and Local Least Squares (LLS), respectively. Mutual strategy is implemented by sharing the estimated data sequences at each recursive process. Meanwhile, we consider the imputation sequence based on the number of missing entries in the target gene. Furthermore, a weight based integrated method is utilized in the final assembling step. RESULTS We evaluate RMI with three state-of-art algorithms (BPCA, LLS, Iterated Local Least Squares imputation (ItrLLS)) on four publicly available microarray datasets. Experimental results clearly demonstrate that RMI significantly outperforms comparative methods in terms of Normalized Root Mean Square Error (NRMSE), especially for datasets with large missing rates and less complete genes. CONCLUSIONS It is noted that our proposed hybrid imputation approach incorporates both global and local information of microarray genes, which achieves lower NRMSE values against to any single approach only. Besides, this study highlights the need for considering the imputing sequence of missing entries for imputation methods.
Collapse
|