1
|
Martínez-García M, Hernández-Lemus E. Data Integration Challenges for Machine Learning in Precision Medicine. Front Med (Lausanne) 2022; 8:784455. [PMID: 35145977 PMCID: PMC8821900 DOI: 10.3389/fmed.2021.784455] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 12/28/2021] [Indexed: 12/19/2022] Open
Abstract
A main goal of Precision Medicine is that of incorporating and integrating the vast corpora on different databases about the molecular and environmental origins of disease, into analytic frameworks, allowing the development of individualized, context-dependent diagnostics, and therapeutic approaches. In this regard, artificial intelligence and machine learning approaches can be used to build analytical models of complex disease aimed at prediction of personalized health conditions and outcomes. Such models must handle the wide heterogeneity of individuals in both their genetic predisposition and their social and environmental determinants. Computational approaches to medicine need to be able to efficiently manage, visualize and integrate, large datasets combining structure, and unstructured formats. This needs to be done while constrained by different levels of confidentiality, ideally doing so within a unified analytical architecture. Efficient data integration and management is key to the successful application of computational intelligence approaches to medicine. A number of challenges arise in the design of successful designs to medical data analytics under currently demanding conditions of performance in personalized medicine, while also subject to time, computational power, and bioethical constraints. Here, we will review some of these constraints and discuss possible avenues to overcome current challenges.
Collapse
Affiliation(s)
- Mireya Martínez-García
- Clinical Research Division, National Institute of Cardiology ‘Ignacio Chávez’, Mexico City, Mexico
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine (INMEGEN), Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autnoma de Mexico, Mexico City, Mexico
| |
Collapse
|
2
|
Guo L, Li S, Yan X, Shen L, Xia D, Xiong Y, Dou Y, Mi L, Ren Y, Xiang Y, Ren D, Wang J, Liang T. A comprehensive multi-omics analysis reveals molecular features associated with cancer via RNA cross-talks in the Notch signaling pathway. Comput Struct Biotechnol J 2022; 20:3972-3985. [PMID: 35950189 PMCID: PMC9340535 DOI: 10.1016/j.csbj.2022.07.036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 07/22/2022] [Accepted: 07/22/2022] [Indexed: 11/05/2022] Open
Abstract
Many Notch genes are identified as cancer-associated genes with an important role in tumorigenesis. Dynamic expression patterns are associated with the Notch activity that are largely regulated by multiple ncRNAs. Cross-talks among diverse RNAs are crucial in cancers via ceRNA network. The Notch pathway shows a robust prognostic ability via integrating multi-omics features as well as their targets. The Notch pathway is also correlated with immune infiltration and maybe available cancer treatment drug targets.
The Notch signaling has an important role in multiple cellular processes and is related to carcinogenic process. To understand the potential molecular features of the crucial Notch pathway, a comprehensive multi-omics analysis is performed to explore its contributions in cancer, mainly including analysis of somatic mutation landscape, pan-cancer expression, ncRNA regulation and potential prognostic power. The screened 22 Notch core genes are relative stable in DNA variation. Dynamic expression patterns are associated with the Notch activity, which are mainly regulated by multiple ncRNAs via interactions of ncRNA:mRNA and ceRNA networks. The Notch pathway shows a potential prognostic ability through integrating multi-omics features as well as their targets, and it is correlated with immune infiltration and maybe available drug targets, implying the potential role in individualized treatment. Collectively, all of these findings contribute to exploring crucial role of the key pathway in cancer pathophysiology and gaining mechanistic insights into cross-talks among RNAs and biological pathways, which indicates the possible application of the well-conserved Notch signaling pathway in precision medicine.
Collapse
|
3
|
Waitman LR, Song X, Walpitage DL, Connolly DC, Patel LP, Liu M, Schroeder MC, VanWormer JJ, Mosa AS, Anye ET, Davis AM. Enhancing PCORnet Clinical Research Network data completeness by integrating multistate insurance claims with electronic health records in a cloud environment aligned with CMS security and privacy requirements. J Am Med Inform Assoc 2021; 29:660-670. [PMID: 34897506 PMCID: PMC8922172 DOI: 10.1093/jamia/ocab269] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 10/10/2021] [Accepted: 11/19/2021] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE The Greater Plains Collaborative (GPC) and other PCORnet Clinical Data Research Networks capture healthcare utilization within their health systems. Here, we describe a reusable environment (GPC Reusable Observable Unified Study Environment [GROUSE]) that integrates hospital and electronic health records (EHRs) data with state-wide Medicare and Medicaid claims and assess how claims and clinical data complement each other to identify obesity and related comorbidities in a patient sample. MATERIALS AND METHODS EHR, billing, and tumor registry data from 7 healthcare systems were integrated with Center for Medicare (2011-2016) and Medicaid (2011-2012) services insurance claims to create deidentified databases in Informatics for Integrating Biology & the Bedside and PCORnet Common Data Model formats. We describe technical details of how this federally compliant, cloud-based data environment was built. As a use case, trends in obesity rates for different age groups are reported, along with the relative contribution of claims and EHR data-to-data completeness and detecting common comorbidities. RESULTS GROUSE contained 73 billion observations from 24 million unique patients (12.9 million Medicare; 13.9 million Medicaid; 6.6 million GPC patients) with 1 674 134 patients crosswalked and 983 450 patients with body mass index (BMI) linked to claims. Diagnosis codes from EHR and claims sources underreport obesity by 2.56 times compared with body mass index measures. However, common comorbidities such as diabetes and sleep apnea diagnoses were more often available from claims diagnoses codes (1.6 and 1.4 times, respectively). CONCLUSION GROUSE provides a unified EHR-claims environment to address health system and federal privacy concerns, which enables investigators to generalize analyses across health systems integrated with multistate insurance claims.
Collapse
Affiliation(s)
- Lemuel R Waitman
- Department of Health Informatics, University of Missouri School of Medicine, Columbia, Missouri, USA
| | - Xing Song
- Department of Health Informatics, University of Missouri School of Medicine, Columbia, Missouri, USA,Corresponding Author: Lemuel R. Waitman, PhD, Department of Health Informatics, University of Missouri School of Medicine, 1st Hospital Drive, Columbia, MO 65212, USA;
| | - Dammika Lakmal Walpitage
- Department of Internal Medicine, Enterprise Analytics, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Daniel C Connolly
- Division of Medical Informatics, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Lav P Patel
- Division of Medical Informatics, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Mei Liu
- Division of Medical Informatics, Department of Internal Medicine, University of Kansas Medical Center, Kansas City, Kansas, USA
| | - Mary C Schroeder
- Division of Health Services Research, Department of Pharmacy Practice and Science, University of Iowa, Iowa City, Iowa, USA
| | - Jeffrey J VanWormer
- Center for Clinical Epidemiology & Population Health, Marshfield Clinic Research Institute, Marshfield, Wisconsin, USA
| | - Abu Saleh Mosa
- Department of Health Informatics, University of Missouri School of Medicine, Columbia, Missouri, USA
| | - Ernest T Anye
- Office of Information Security, University of Missouri Health, Columbia, Missouri, USA
| | - Ann M Davis
- Department of Pediatrics, University of Kansas Medical Center, Kansas City, Kansas, USA,Center for Children’s Healthy Lifestyles & Nutrition, Kansas City, Missouri, USA
| |
Collapse
|
4
|
Grzesik P, Augustyn DR, Wyciślik Ł, Mrozek D. Serverless computing in omics data analysis and integration. Brief Bioinform 2021; 23:6367629. [PMID: 34505137 PMCID: PMC8499876 DOI: 10.1093/bib/bbab349] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 06/28/2021] [Accepted: 08/06/2021] [Indexed: 11/30/2022] Open
Abstract
A comprehensive analysis of omics data can require vast computational resources and access to varied data sources that must be integrated into complex, multi-step analysis pipelines. Execution of many such analyses can be accelerated by applying the cloud computing paradigm, which provides scalable resources for storing data of different types and parallelizing data analysis computations. Moreover, these resources can be reused for different multi-omics analysis scenarios. Traditionally, developers are required to manage a cloud platform’s underlying infrastructure, configuration, maintenance and capacity planning. The serverless computing paradigm simplifies these operations by automatically allocating and maintaining both servers and virtual machines, as required for analysis tasks. This paradigm offers highly parallel execution and high scalability without manual management of the underlying infrastructure, freeing developers to focus on operational logic. This paper reviews serverless solutions in bioinformatics and evaluates their usage in omics data analysis and integration. We start by reviewing the application of the cloud computing model to a multi-omics data analysis and exposing some shortcomings of the early approaches. We then introduce the serverless computing paradigm and show its applicability for performing an integrative analysis of multiple omics data sources in the context of the COVID-19 pandemic.
Collapse
Affiliation(s)
- Piotr Grzesik
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Dariusz R Augustyn
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Łukasz Wyciślik
- Silesian University of Technology, Department of Applied Informatics, Gliwice 44-100, Poland
| | - Dariusz Mrozek
- Corresponding author: Dariusz Mrozek, Department of Applied Informatics, Silesian University of Technology, Gliwice 44-100, Poland. E-mail:
| |
Collapse
|
5
|
Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinform Biol Insights 2021; 15:11779322211035921. [PMID: 34376975 PMCID: PMC8323418 DOI: 10.1177/11779322211035921] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
High-throughput experiments enable researchers to explore complex multifactorial
diseases through large-scale analysis of omics data. Challenges for such
high-dimensional data sets include storage, analyses, and sharing. Recent
innovations in computational technologies and approaches, especially in cloud
computing, offer a promising, low-cost, and highly flexible solution in the
bioinformatics domain. Cloud computing is rapidly proving increasingly useful in
molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or
proteomics data sets), and for the integration, analysis, and interpretation of
phenotypic data. We review the adoption of advanced cloud-based and big data
technologies for processing and analyzing omics data and provide insights into
state-of-the-art cloud bioinformatics applications.
Collapse
Affiliation(s)
- Saraswati Koppad
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Annappa B
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Biomedical Research Centre, University Hospitals Birmingham, Birmingham, UK
| | - Animesh Acharjee
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK
| |
Collapse
|
6
|
Arshad S, Arshad J, Khan MM, Parkinson S. Analysis of security and privacy challenges for DNA-genomics applications and databases. J Biomed Inform 2021; 119:103815. [PMID: 34022422 DOI: 10.1016/j.jbi.2021.103815] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 05/07/2021] [Accepted: 05/08/2021] [Indexed: 02/06/2023]
Abstract
DNA technology is rapidly moving towards digitization. Scientists use software tools and applications for sequencing, synthesizing, analyzing and sharing of DNA and genomic data, operate lab equipment and store genetic information in shared datastores. Using cutting-edge computing methods and techniques, researchers have decoded human genome, created organisms with new capabilities, automated drug development and transformed food safety. Such software applications are typically developed to progress scientific understanding and as such cyber security is never a concern for these applications. However, with the increasing commercialisation of DNA technologies, coupled with the sensitivity of DNA data, there is a need to adopt a security-by-design approach. In this paper we investigate bio-cyber security threats to genomic-DNA data and software applications making use of such data to advance scientific research. Specifically, we adopt an empirical approach to analyse and identify vulnerabilities within genomic-DNA databases and bioinformatics software applications that can lead to cyber-attacks affecting the confidentiality, integrity and availability of such sensitive data. We present a detailed analysis of these threats and highlight potential protection mechanisms to help researchers pursue these research directions.
Collapse
Affiliation(s)
- Saadia Arshad
- Department of Computer Science & IT, NED University of Engineering and Technology, Karachi, Pakistan
| | - Junaid Arshad
- School of Computing and Digital Technology, Birmingham City University, Birmingham, UK.
| | - Muhammad Mubashir Khan
- Department of Computer Science & IT, NED University of Engineering and Technology, Karachi, Pakistan
| | - Simon Parkinson
- Department of Computer Science, University of Huddersfield, Huddersfield, UK
| |
Collapse
|
7
|
|
8
|
Liu J, Liu Q, Zhang L, Su S, Liu Y. Enabling Massive XML-Based Biological Data Management in HBase. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1994-2004. [PMID: 31094692 DOI: 10.1109/tcbb.2019.2915811] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Publishing biological data in XML formats is attractive for organizations who would like to provide their bioinformatics resources in an extensible and machine-readable format. In the era of big data, massive XML-based biological data management is emerged as a challengeable issue. With the continuous growth of the XML-based biological data sets, it is usually frustrating to use traditional declarative query languages to provide efficient query capabilities in terms of processing speed and scale. In this study, we report a novel platform to store and query massive XML-based biological data collections. A prototype tool for constructing HBase tables from XML-based biological data collections is first developed, and then a formal approach to transform the XML query model into the MapReduce query model is proposed. Finally, an evaluation of the query performance of the proposed approach on the existing XML-based biological databases is presented, showing that the performance advantages of the proposed solution. The source code of the massive XML-based biological data management platform is freely available at https://github.com/lyotvincent/X2H.
Collapse
|
9
|
Krissaane I, De Niz C, Gutiérrez-Sacristán A, Korodi G, Ede N, Kumar R, Lyons J, Manrai A, Patel C, Kohane I, Avillach P. Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. J Am Med Inform Assoc 2020; 27:1425-1430. [PMID: 32719837 PMCID: PMC7534581 DOI: 10.1093/jamia/ocaa068] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 03/20/2020] [Accepted: 04/17/2020] [Indexed: 01/14/2023] Open
Abstract
Objective Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies. Methods We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset. Results Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics. Conclusions We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Paul Avillach
- Corresponding Author: Paul Avillach, Department of Biomedical Informatics, Harvard Medical School, Harvard University, Boston 02115, MA, USA;
| |
Collapse
|
10
|
Pividori M, Im HK. ukbREST: efficient and streamlined data access for reproducible research in large biobanks. Bioinformatics 2020; 35:1971-1973. [PMID: 30395166 PMCID: PMC6546122 DOI: 10.1093/bioinformatics/bty925] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 10/26/2018] [Accepted: 11/03/2018] [Indexed: 12/03/2022] Open
Abstract
Summary Large biobanks, such as UK Biobank with half a million participants, are changing the scale and availability of genotypic and phenotypic data for researchers to ask fundamental questions about the biology of health and disease. The breadth of the UK Biobank data is enabling discoveries at an unprecedented pace. However, this size and complexity pose new challenges to investigators who need to keep the accruing data up to date, comply with potential consent changes, and efficiently and reproducibly extract subsets of the data to answer specific scientific questions. Here we propose a tool called ukbREST designed for the UK Biobank study (easily extensible to other biobanks), which allows authorized users to efficiently retrieve phenotypic and genetic data. It exposes a REST API that makes data highly accessible inside a private and secure network, allowing the data specification in a human readable text format easily shareable with other researchers. These characteristics make ukbREST an important tool to make biobank’s valuable data more readily accessible to the research community and facilitate reproducibility of the analysis, a key aspect of science. Availability and implementation It is implemented in Python using the Flask-RESTful framework for the API, and it is under the MIT license. It works with PostgreSQL and a Docker image is available for easy deployment. The source code and documentation is available in Github: https://github.com/hakyimlab/ukbrest.
Collapse
Affiliation(s)
- Milton Pividori
- Department of Medicine, Section of Genetic Medicine, The University of Chicago, Chicago, IL, USA.,Center for Translational Data Science, The University of Chicago, Chicago, IL, USA
| | - Hae Kyung Im
- Department of Medicine, Section of Genetic Medicine, The University of Chicago, Chicago, IL, USA.,Center for Translational Data Science, The University of Chicago, Chicago, IL, USA
| |
Collapse
|
11
|
Alliey-Rodriguez N, Grey TA, Shafee R, Asif H, Lutz O, Bolo NR, Padmanabhan J, Tandon N, Klinger M, Reis K, Spring J, Coppes L, Zeng V, Hegde RR, Hoang DT, Bannai D, Nawaz U, Henson P, Liu S, Gage D, McCarroll S, Bishop JR, Hill S, Reilly JL, Lencer R, Clementz BA, Buckley P, Glahn DC, Meda SA, Narayanan B, Pearlson G, Keshavan MS, Ivleva EI, Tamminga C, Sweeney JA, Curtis D, Badner JA, Keedy S, Rapoport J, Liu C, Gershon ES. NRXN1 is associated with enlargement of the temporal horns of the lateral ventricles in psychosis. Transl Psychiatry 2019; 9:230. [PMID: 31530798 PMCID: PMC6748921 DOI: 10.1038/s41398-019-0564-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Revised: 07/11/2019] [Accepted: 07/30/2019] [Indexed: 12/19/2022] Open
Abstract
Schizophrenia, Schizoaffective, and Bipolar disorders share behavioral and phenomenological traits, intermediate phenotypes, and some associated genetic loci with pleiotropic effects. Volumetric abnormalities in brain structures are among the intermediate phenotypes consistently reported associated with these disorders. In order to examine the genetic underpinnings of these structural brain modifications, we performed genome-wide association analyses (GWAS) on 60 quantitative structural brain MRI phenotypes in a sample of 777 subjects (483 cases and 294 controls pooled together). Genotyping was performed with the Illumina PsychChip microarray, followed by imputation to the 1000 genomes multiethnic reference panel. Enlargement of the Temporal Horns of Lateral Ventricles (THLV) is associated with an intronic SNP of the gene NRXN1 (rs12467877, P = 6.76E-10), which accounts for 4.5% of the variance in size. Enlarged THLV is associated with psychosis in this sample, and with reduction of the hippocampus and enlargement of the choroid plexus and caudate. Eight other suggestively significant associations (P < 5.5E-8) were identified with THLV and 5 other brain structures. Although rare deletions of NRXN1 have been previously associated with psychosis, this is the first report of a common SNP variant of NRXN1 associated with enlargement of the THLV in psychosis.
Collapse
Affiliation(s)
- Ney Alliey-Rodriguez
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA.
| | - Tamar A. Grey
- 0000 0001 2341 2786grid.116068.8Massachusetts Institute of Technology, Cambridge, USA
| | - Rebecca Shafee
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Genetics, Boston, USA ,grid.66859.34Stanley Center, Broad Institute of MIT and Harvard, Cambridge, USA
| | - Huma Asif
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Olivia Lutz
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Nicolas R. Bolo
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Jaya Padmanabhan
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Neeraj Tandon
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Madeline Klinger
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Katherine Reis
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Jonathan Spring
- University of Chicago Laboratory for Advanced Computing, Chicago, USA
| | - Lucas Coppes
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Victor Zeng
- 000000041936754Xgrid.38142.3cHarvard University, Cambridge, USA
| | - Rachal R. Hegde
- 0000 0004 1936 7558grid.189504.1Boston University, Boston, USA
| | - Dung T. Hoang
- 000000041936754Xgrid.38142.3cHarvard University, Cambridge, USA
| | - Deepthi Bannai
- 0000 0004 1936 7558grid.189504.1Boston University, Boston, USA
| | - Uzma Nawaz
- 0000 0004 1936 7558grid.189504.1Boston University, Boston, USA
| | - Philip Henson
- 000000041936754Xgrid.38142.3cHarvard University, Cambridge, USA
| | - Siyuan Liu
- 0000 0001 2297 5165grid.94365.3dChild Psychiatry Branch, National Institutes of Mental Health, National Institutes of Health, Bethesda, MD USA
| | - Diane Gage
- grid.66859.34Broad Institute of MIT and Harvard, Cambridge, USA
| | | | - Jeffrey R. Bishop
- 0000000419368657grid.17635.36University of Minnesota, Department of Experimental and Clinical Pharmacology and Department of Psychiatry, Minneapolis, USA
| | - Scot Hill
- 0000 0004 0388 7807grid.262641.5Rosalind Franklin University, North Chicago, USA
| | - James L. Reilly
- 0000 0001 2299 3507grid.16753.36Northwestern University, Evanston, USA
| | - Rebekka Lencer
- 0000 0001 2172 9288grid.5949.1University of Muenster, Munster, Germany
| | - Brett A. Clementz
- 0000 0000 9564 9822grid.264978.6Department of Psychology, University of Georgia, Athens, Georgia
| | - Peter Buckley
- 0000 0004 0458 8737grid.224260.0Virginia Commonwealth University, Richmond, USA
| | - David C. Glahn
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Shashwath A. Meda
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Balaji Narayanan
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Godfrey Pearlson
- 0000000419368710grid.47100.32Yale University Departments of Psychiatry & Neuroscience, New Haven, USA
| | - Matcheri S. Keshavan
- 000000041936754Xgrid.38142.3cHarvard Medical School, Department of Psychiatry, Boston, USA
| | - Elena I. Ivleva
- 0000 0000 9482 7121grid.267313.2University of Texas Southwestern Medical Center, Department of Psychiatry, Dallas, USA
| | - Carol Tamminga
- 0000 0000 9482 7121grid.267313.2University of Texas Southwestern Medical Center, Department of Psychiatry, Dallas, USA
| | - John A. Sweeney
- 0000 0000 9482 7121grid.267313.2University of Texas Southwestern Medical Center, Department of Psychiatry, Dallas, USA
| | - David Curtis
- 0000 0001 2171 1133grid.4868.2University College London and Centre for Psychiatry, Barts and the London School of Medicine and Dentistry, London, UK
| | - Judith A. Badner
- 0000 0001 0705 3621grid.240684.cRush University Medical Center, Chicago, USA
| | - Sarah Keedy
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA
| | - Judith Rapoport
- 0000 0001 2297 5165grid.94365.3dChild Psychiatry Branch, National Institutes of Mental Health, National Institutes of Health, Bethesda, MD USA
| | - Chunyu Liu
- 0000 0000 9159 4457grid.411023.5SUNY Upstate Medical University, Binghamton, USA
| | - Elliot S. Gershon
- University of Chicago, Department of Psychiatry and Behavioral Neurosciences, Chicago, USA ,University of Chicago, Department of Human Genetics, Chicago, USA
| |
Collapse
|
12
|
Senf A. End-to-End Security for Local and Remote Human Genetic Data Applications at the EGA. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1324-1327. [PMID: 31095492 DOI: 10.1109/tcbb.2019.2916810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Sensitive genomic data should remain secure - whether on disk for storage, or analysis, or in transport. However, secure storage, delivery, and usage of genomic data is complicated by the size of files and diversity of workflows. This paper presents solutions developed by GA4GH and EGA to use custom-ized encryption, encrypted file formats, toolchain integration, and intelligent APIs to help solve this problem.
Collapse
|
13
|
Shared and distinct genetic risk factors for childhood-onset and adult-onset asthma: genome-wide and transcriptome-wide studies. THE LANCET RESPIRATORY MEDICINE 2019; 7:509-522. [PMID: 31036433 DOI: 10.1016/s2213-2600(19)30055-4] [Citation(s) in RCA: 175] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Revised: 12/20/2018] [Accepted: 01/07/2019] [Indexed: 02/07/2023]
Abstract
BACKGROUND Childhood-onset and adult-onset asthma differ with respect to severity and comorbidities. Whether they also differ with respect to genetic risk factors has not been previously investigated in large samples. The goals of this study were to identify shared and distinct genetic risk loci for childhood-onset and adult-onset asthma, and to identify the genes that might mediate the effects of associated variation. METHODS We did genome-wide and transcriptome-wide studies, using data from the UK Biobank, in individuals with asthma, including adults with childhood-onset asthma (onset before 12 years of age), adults with adult-onset asthma (onset between 26 and 65 years of age), and adults without asthma (controls; aged older than 38 years). We did genome-wide association studies (GWAS) for childhood-onset asthma and adult-onset asthma each compared with shared controls, and for age of asthma onset in all asthma cases, with a genome-wide significance threshold of p<5 × 10-8. Enrichment studies determined the tissues in which genes at GWAS loci were most highly expressed, and PrediXcan, a transcriptome-wide gene-based test, was used to identify candidate risk genes. FINDINGS Of 376 358 British white individuals from the UK Biobank, we included 37 846 with self-reports of doctor-diagnosed asthma: 9433 adults with childhood-onset asthma; 21 564 adults with adult-onset asthma; and an additional 6849 young adults with asthma with onset between 12 and 25 years of age. For the first and second GWAS analyses, 318 237 individuals older than 38 years without asthma were used as controls. We detected 61 independent asthma loci: 23 were childhood-onset specific, one was adult-onset specific, and 37 were shared. 19 loci were associated with age of asthma onset. The most significant asthma-associated locus was at 17q12 (odds ratio 1·406, 95% CI 1·365-1·448; p=1·45 × 10-111) in the childhood-onset GWAS. Genes at the childhood onset-specific loci were most highly expressed in skin, blood, and small intestine; genes at the adult onset-specific loci were most highly expressed in lung, blood, small intestine, and spleen. PrediXcan identified 113 unique candidate genes at 22 of the 61 GWAS loci. Single-nucleotide polymorphism-based heritability estimates were more than three times larger for childhood-onset asthma (0·327) than for adult-onset disease (0·098). The onset of disease in childhood was associated with additional genes with relatively large effect sizes, with the largest odds ratio observed at the FLG locus at 1q21.3 (1·970, 95% CI 1·823-2·129). INTERPRETATION Genetic risk factors for adult-onset asthma are largely a subset of the genetic risk for childhood-onset asthma but with overall smaller effects, suggesting a greater role for non-genetic risk factors in adult-onset asthma. Combined with gene expression and tissue enrichment patterns, we suggest that the establishment of disease in children is driven more by dysregulated allergy and epithelial barrier function genes, whereas the cause of adult-onset asthma is more lung-centred and environmentally determined, but with immune-mediated mechanisms driving disease progression in both children and adults. FUNDING US National Institutes of Health.
Collapse
|
14
|
Abstract
One of the recommendations of the Cancer Moonshot Blue Ribbon Panel report from 2016 was the creation of a national cancer data ecosystem. We review some of the approaches for building cancer data ecosystems and some of the progress that has been made. A data commons is the colocation of data with cloud computing infrastructure and commonly used software services, tools, and applications for managing, integrating, analyzing, and sharing data to create an interoperable resource for the research community. We discuss data commons and their potential role in cancer data ecosystems and, in particular, how multiple data commons can interoperate to form part of the foundation for a cancer data ecosystem.
Collapse
|
15
|
Grossman RL. Data Lakes, Clouds, and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data. Trends Genet 2019; 35:223-234. [PMID: 30691868 PMCID: PMC6474403 DOI: 10.1016/j.tig.2018.12.006] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Revised: 12/20/2018] [Accepted: 12/26/2018] [Indexed: 12/30/2022]
Abstract
Data commons collate data with cloud computing infrastructure and commonly used software services, tools, and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize, and share large-scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import, and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing, and sharing genomic data, with an emphasis on data commons, but also cover data ecosystems and data lakes.
Collapse
Affiliation(s)
- Robert L Grossman
- Center for Translational Data Science, University of Chicago, 900 East 57th Street, KCBD 10142, Chicago, IL 60637, USA.
| |
Collapse
|
16
|
Geeleher P, Nath A, Wang F, Zhang Z, Barbeira AN, Fessler J, Grossman RL, Seoighe C, Stephanie Huang R. Cancer expression quantitative trait loci (eQTLs) can be determined from heterogeneous tumor gene expression data by modeling variation in tumor purity. Genome Biol 2018; 19:130. [PMID: 30205839 PMCID: PMC6131897 DOI: 10.1186/s13059-018-1507-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Accepted: 08/14/2018] [Indexed: 02/06/2023] Open
Abstract
Expression quantitative trait loci (eQTLs) identified using tumor gene expression data could affect gene expression in cancer cells, tumor-associated normal cells, or both. Here, we have demonstrated a method to identify eQTLs affecting expression in cancer cells by modeling the statistical interaction between genotype and tumor purity. Only one third of breast cancer risk variants, identified as eQTLs from a conventional analysis, could be confidently attributed to cancer cells. The remaining variants could affect cells of the tumor microenvironment, such as immune cells and fibroblasts. Deconvolution of tumor eQTLs will help determine how inherited polymorphisms influence cancer risk, development, and treatment response.
Collapse
Affiliation(s)
- Paul Geeleher
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
- Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Aritro Nath
- Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago, IL, USA
- Department of Experimental and Clinical Pharmacology, University of Minnesota, Minneapolis, MN, USA
| | - Fan Wang
- Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago, IL, USA
- Ben May Department for Cancer Research, University of Chicago, Chicago, IL, USA
| | - Zhenyu Zhang
- Center for Data Intensive Science, University of Chicago, Chicago, IL, USA
| | - Alvaro N Barbeira
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Jessica Fessler
- Department of Pathology, University of Chicago, Chicago, IL, USA
| | - Robert L Grossman
- Center for Data Intensive Science, University of Chicago, Chicago, IL, USA
| | - Cathal Seoighe
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Ireland
| | - R Stephanie Huang
- Section of Hematology/Oncology, Department of Medicine, University of Chicago, Chicago, IL, USA.
- Department of Experimental and Clinical Pharmacology, University of Minnesota, Minneapolis, MN, USA.
- Department of Experimental and Clinical Pharmacology, College of Pharmacy, Room 5-130 WDH, 1332A, 308 Harvard St SE, Minneapolis, MN, 55455, USA.
| |
Collapse
|
17
|
Abstract
Biomedical research has become a digital data–intensive endeavor, relying on secure and scalable computing, storage, and network infrastructure, which has traditionally been purchased, supported, and maintained locally. For certain types of biomedical applications, cloud computing has emerged as an alternative to locally maintained traditional computing approaches. Cloud computing offers users pay-as-you-go access to services such as hardware infrastructure, platforms, and software for solving common biomedical computational problems. Cloud computing services offer secure on-demand storage and analysis and are differentiated from traditional high-performance computing by their rapid availability and scalability of services. As such, cloud services are engineered to address big data problems and enhance the likelihood of data and analytics sharing, reproducibility, and reuse. Here, we provide an introductory perspective on cloud computing to help the reader determine its value to their own research.
Collapse
|
18
|
Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, Torstenson ES, Shah KP, Garcia T, Edwards TL, Stahl EA, Huckins LM, Nicolae DL, Cox NJ, Im HK. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun 2018; 9:1825. [PMID: 29739930 PMCID: PMC5940825 DOI: 10.1038/s41467-018-03621-1] [Citation(s) in RCA: 561] [Impact Index Per Article: 93.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Accepted: 12/27/2017] [Indexed: 12/25/2022] Open
Abstract
Scalable, integrative methods to understand mechanisms that link genetic variants with phenotypes are needed. Here we derive a mathematical expression to compute PrediXcan (a gene mapping approach) results using summary data (S-PrediXcan) and show its accuracy and general robustness to misspecified reference sets. We apply this framework to 44 GTEx tissues and 100+ phenotypes from GWAS and meta-analysis studies, creating a growing public catalog of associations that seeks to capture the effects of gene expression variation on human phenotypes. Replication in an independent cohort is shown. Most of the associations are tissue specific, suggesting context specificity of the trait etiology. Colocalized significant associations in unexpected tissues underscore the need for an agnostic scanning of multiple contexts to improve our ability to detect causal regulatory mechanisms. Monogenic disease genes are enriched among significant associations for related traits, suggesting that smaller alterations of these genes may cause a spectrum of milder phenotypes.
Collapse
Affiliation(s)
- Alvaro N Barbeira
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Scott P Dickinson
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Rodrigo Bonazzola
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Jiamao Zheng
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Heather E Wheeler
- Department of Biology, Loyola University Chicago, Chicago, IL, 60660, USA.,Department of Computer Science, Loyola University Chicago, Chicago, IL, 60660, USA
| | - Jason M Torres
- Committee on Molecular Metabolism and Nutrition, The University of Chicago, Chicago, IL, 60637, USA
| | - Eric S Torstenson
- Vanderbilt Genetic Institute, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Kaanan P Shah
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Tzintzuni Garcia
- Center for Research Informatics, The University of Chicago, Chicago, IL, 60615, USA
| | - Todd L Edwards
- Division of Epidemiology, Department of Medicine, Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Eli A Stahl
- Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, NYC, NY, 10029, USA.,Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai, NYC, NY, 10029, USA
| | - Laura M Huckins
- Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, NYC, NY, 10029, USA.,Department of Genetics and Genomics, Icahn School of Medicine at Mount Sinai, NYC, NY, 10029, USA
| | | | - Dan L Nicolae
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA
| | - Nancy J Cox
- Vanderbilt Genetic Institute, Vanderbilt University Medical Center, Nashville, TN, 37232, USA
| | - Hae Kyung Im
- Section of Genetic Medicine, The University of Chicago, Chicago, IL, 60637, USA.
| |
Collapse
|
19
|
Abstract
Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.
Collapse
Affiliation(s)
- Ben Langmead
- Department of Computer Science, Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Abhinav Nellore
- Department of Biomedical Engineering, Department of Surgery, Computational Biology Program, Oregon Health and Science University, Portland, OR, USA
| |
Collapse
|
20
|
VASILE MA, POP F, NIŢĂ MC, CRISTEA V. MLBox: Machine learning box for asymptotic scheduling. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2017.01.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
21
|
Geeleher P, Zhang Z, Wang F, Gruener RF, Nath A, Morrison G, Bhutra S, Grossman RL, Huang RS. Discovering novel pharmacogenomic biomarkers by imputing drug response in cancer patients from large genomics studies. Genome Res 2017; 27:1743-1751. [PMID: 28847918 PMCID: PMC5630037 DOI: 10.1101/gr.221077.117] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2017] [Accepted: 08/03/2017] [Indexed: 12/20/2022]
Abstract
Obtaining accurate drug response data in large cohorts of cancer patients is very challenging; thus, most cancer pharmacogenomics discovery is conducted in preclinical studies, typically using cell lines and mouse models. However, these platforms suffer from serious limitations, including small sample sizes. Here, we have developed a novel computational method that allows us to impute drug response in very large clinical cancer genomics data sets, such as The Cancer Genome Atlas (TCGA). The approach works by creating statistical models relating gene expression to drug response in large panels of cancer cell lines and applying these models to tumor gene expression data in the clinical data sets (e.g., TCGA). This yields an imputed drug response for every drug in each patient. These imputed drug response data are then associated with somatic genetic variants measured in the clinical cohort, such as copy number changes or mutations in protein coding genes. These analyses recapitulated drug associations for known clinically actionable somatic genetic alterations and identified new predictive biomarkers for existing drugs.
Collapse
Affiliation(s)
- Paul Geeleher
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| | - Zhenyu Zhang
- Center for Data Intensive Science, The University of Chicago, Chicago, Illinois 60637, USA
| | - Fan Wang
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| | - Robert F Gruener
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| | - Aritro Nath
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| | - Gladys Morrison
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| | - Steven Bhutra
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| | - Robert L Grossman
- Center for Data Intensive Science, The University of Chicago, Chicago, Illinois 60637, USA
| | - R Stephanie Huang
- Section of Hematology/Oncology, The University of Chicago, Chicago, Illinois 60637, USA
| |
Collapse
|
22
|
Mashl RJ, Scott AD, Huang KL, Wyczalkowski MA, Yoon CJ, Niu B, DeNardo E, Yellapantula VD, Handsaker RE, Chen K, Koboldt DC, Ye K, Fenyö D, Raphael BJ, Wendl MC, Ding L. GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res 2017; 27:1450-1459. [PMID: 28522612 PMCID: PMC5538560 DOI: 10.1101/gr.211656.116] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2016] [Accepted: 05/03/2017] [Indexed: 12/12/2022]
Abstract
Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional “download and analyze” paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.
Collapse
Affiliation(s)
- R Jay Mashl
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Adam D Scott
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Kuan-Lin Huang
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | | | - Christopher J Yoon
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Beifang Niu
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Erin DeNardo
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Venkata D Yellapantula
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - Robert E Handsaker
- Stanley Center for Psychiatric Research, Broad Institute, Cambridge, Massachusetts 02142, USA.,Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Ken Chen
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030, USA
| | - Daniel C Koboldt
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA
| | - Kai Ye
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA
| | - David Fenyö
- Langone Medical Center, New York University, New York, New York 10016, USA
| | - Benjamin J Raphael
- Department of Computer Science and Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912, USA
| | - Michael C Wendl
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Department of Genetics, Washington University, St. Louis, Missouri 63108, USA.,Department of Mathematics, Washington University, St. Louis, Missouri 63108, USA
| | - Li Ding
- McDonnell Genome Institute, Washington University, St. Louis, Missouri 63108, USA.,Division of Oncology, Department of Medicine, Washington University, St. Louis, Missouri 63108, USA.,Department of Genetics, Washington University, St. Louis, Missouri 63108, USA.,Siteman Cancer Center, Washington University, St. Louis, Missouri 63108, USA
| |
Collapse
|
23
|
Weng C, Kahn MG. Clinical Research Informatics for Big Data and Precision Medicine. Yearb Med Inform 2016:211-218. [PMID: 27830253 DOI: 10.15265/iy-2016-019] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
OBJECTIVES To reflect on the notable events and significant developments in Clinical Research Informatics (CRI) in the year of 2015 and discuss near-term trends impacting CRI. METHODS We selected key publications that highlight not only important recent advances in CRI but also notable events likely to have significant impact on CRI activities over the next few years or longer, and consulted the discussions in relevant scientific communities and an online living textbook for modern clinical trials. We also related the new concepts with old problems to improve the continuity of CRI research. RESULTS The highlights in CRI in 2015 include the growing adoption of electronic health records (EHR), the rapid development of regional, national, and global clinical data research networks for using EHR data to integrate scalable clinical research with clinical care and generate robust medical evidence. Data quality, integration, and fusion, data access by researchers, study transparency, results reproducibility, and infrastructure sustainability are persistent challenges. CONCLUSION The advances in Big Data Analytics and Internet technologies together with the engagement of citizens in sciences are shaping the global clinical research enterprise, which is getting more open and increasingly stakeholder-centered, where stakeholders include patients, clinicians, researchers, and sponsors.
Collapse
Affiliation(s)
- C Weng
- Chunhua Weng, PhD, FACMI, Department of Biomedical Informatics, Columbia University, 622 W 168 Street, PH-20, New York, NY 10032, USA, E-mail:
| | | |
Collapse
|
24
|
Grossman RL, Heath A, Murphy M, Patterson M, Wells W. A Case for Data Commons: Toward Data Science as a Service. Comput Sci Eng 2016; 18:10-20. [PMID: 29033693 DOI: 10.1109/mcse.2016.92] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Data commons collocate data, storage, and computing infrastructure with core services and commonly used tools and applications for managing, analyzing, and sharing data to create an interoperable resource for the research community. An architecture for data commons is described, as well as some lessons learned from operating several large-scale data commons.
Collapse
Affiliation(s)
| | | | | | | | - Walt Wells
- Center for Computational Science Research
| |
Collapse
|
25
|
Roy S, LaFramboise WA, Nikiforov YE, Nikiforova MN, Routbort MJ, Pfeifer J, Nagarajan R, Carter AB, Pantanowitz L. Next-Generation Sequencing Informatics: Challenges and Strategies for Implementation in a Clinical Environment. Arch Pathol Lab Med 2016; 140:958-75. [PMID: 26901284 DOI: 10.5858/arpa.2015-0507-ra] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
CONTEXT -Next-generation sequencing (NGS) is revolutionizing the discipline of laboratory medicine, with a deep and direct impact on patient care. Although it empowers clinical laboratories with unprecedented genomic sequencing capability, NGS has brought along obvious and obtrusive informatics challenges. Bioinformatics and clinical informatics are separate disciplines with typically a small degree of overlap, but they have been brought together by the enthusiastic adoption of NGS in clinical laboratories. The result has been a collaborative environment for the development of novel informatics solutions. Sustaining NGS-based testing in a regulated clinical environment requires institutional support to build and maintain a practical, robust, scalable, secure, and cost-effective informatics infrastructure. OBJECTIVE -To discuss the novel NGS informatics challenges facing pathology laboratories today and offer solutions and future developments to address these obstacles. DATA SOURCES -The published literature pertaining to NGS informatics was reviewed. The coauthors, experts in the fields of molecular pathology, precision medicine, and pathology informatics, also contributed their experiences. CONCLUSIONS -The boundary between bioinformatics and clinical informatics has significantly blurred with the introduction of NGS into clinical molecular laboratories. Next-generation sequencing technology and the data derived from these tests, if managed well in the clinical laboratory, will redefine the practice of medicine. In order to sustain this progress, adoption of smart computing technology will be essential. Computational pathologists will be expected to play a major role in rendering diagnostic and theranostic services by leveraging "Big Data" and modern computing tools.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Liron Pantanowitz
- From the Department of Pathology, University of Pittsburgh Medical Center, Pittsburgh, Pennsylvania (Drs Roy, LaFramboise, Nikiforov, Nikiforova, and Pantanowitz); the Department of Pathology, MD Anderson Cancer Center, Houston, Texas (Dr Routbort); the Department of Pathology and Immunology, Washington University School of Medicine, St Louis, Missouri (Drs Pfeifer and Nagarajan); PierianDx, St Louis, Missouri (Dr Nagarajan); and the Department of Pathology and Laboratory Medicine, Children's Healthcare of Atlanta, Atlanta, Georgia (Dr Carter)
| |
Collapse
|
26
|
Abstract
Molecular informatics (MI) is an evolving discipline that will support the dynamic landscape of molecular pathology and personalized medicine. MI provides a fertile ground for development of clinical solutions to bridge the gap between clinical informatics and bioinformatics. Rapid adoption of next generation sequencing (NGS) in the clinical arena has triggered major endeavors in MI that are expected to bring a paradigm shift in the practice of pathology. This brief review presents a broad overview of various aspects of MI, particularly in the context of NGS based testing.
Collapse
Affiliation(s)
- Somak Roy
- Department of Pathology, Molecular and Genomic Pathology, University of Pittsburgh Medical Center, 3477 Euler way, Pittsburgh, PA 15213, USA.
| |
Collapse
|
27
|
Abstract
High-throughput platforms such as microarray, mass spectrometry, and next-generation sequencing are producing an increasing volume of omics data that needs large data storage and computing power. Cloud computing offers massive scalable computing and storage, data sharing, on-demand anytime and anywhere access to resources and applications, and thus, it may represent the key technology for facing those issues. In fact, in the recent years it has been adopted for the deployment of different bioinformatics solutions and services both in academia and in the industry. Although this, cloud computing presents several issues regarding the security and privacy of data, that are particularly important when analyzing patients data, such as in personalized medicine. This chapter reviews main academic and industrial cloud-based bioinformatics solutions; with a special focus on microarray data analysis solutions and underlines main issues and problems related to the use of such platforms for the storage and analysis of patients data.
Collapse
Affiliation(s)
- Barbara Calabrese
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
| | - Mario Cannataro
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy.
| |
Collapse
|
28
|
Ocaña K, de Oliveira D. Parallel computing in genomic research: advances and applications. Adv Appl Bioinform Chem 2015; 8:23-35. [PMID: 26604801 PMCID: PMC4655901 DOI: 10.2147/aabc.s64482] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Today's genomic experiments have to process the so-called "biological big data" that is now reaching the size of Terabytes and Petabytes. To process this huge amount of data, scientists may require weeks or months if they use their own workstations. Parallelism techniques and high-performance computing (HPC) environments can be applied for reducing the total processing time and to ease the management, treatment, and analyses of this data. However, running bioinformatics experiments in HPC environments such as clouds, grids, clusters, and graphics processing unit requires the expertise from scientists to integrate computational, biological, and mathematical techniques and technologies. Several solutions have already been proposed to allow scientists for processing their genomic experiments using HPC capabilities and parallelism techniques. This article brings a systematic review of literature that surveys the most recently published research involving genomics and parallel computing. Our objective is to gather the main characteristics, benefits, and challenges that can be considered by scientists when running their genomic experiments to benefit from parallelism techniques and HPC capabilities.
Collapse
Affiliation(s)
- Kary Ocaña
- National Laboratory of Scientific Computing, Petrópolis, Rio de Janeiro, Brazil
| | | |
Collapse
|
29
|
Alyass A, Turcotte M, Meyre D. From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics 2015; 8:33. [PMID: 26112054 PMCID: PMC4482045 DOI: 10.1186/s12920-015-0108-y] [Citation(s) in RCA: 220] [Impact Index Per Article: 24.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Accepted: 06/15/2015] [Indexed: 02/07/2023] Open
Abstract
Recent advances in high-throughput technologies have led to the emergence of systems biology as a holistic science to achieve more precise modeling of complex diseases. Many predict the emergence of personalized medicine in the near future. We are, however, moving from two-tiered health systems to a two-tiered personalized medicine. Omics facilities are restricted to affluent regions, and personalized medicine is likely to widen the growing gap in health systems between high and low-income countries. This is mirrored by an increasing lag between our ability to generate and analyze big data. Several bottlenecks slow-down the transition from conventional to personalized medicine: generation of cost-effective high-throughput data; hybrid education and multidisciplinary teams; data storage and processing; data integration and interpretation; and individual and global economic relevance. This review provides an update of important developments in the analysis of big data and forward strategies to accelerate the global transition to personalized medicine.
Collapse
Affiliation(s)
- Akram Alyass
- Department of Clinical Epidemiology and Biostatistics, McMaster University, 1280 Main Street West, Hamilton, ON, Canada.
| | - Michelle Turcotte
- Department of Clinical Epidemiology and Biostatistics, McMaster University, 1280 Main Street West, Hamilton, ON, Canada.
| | - David Meyre
- Department of Clinical Epidemiology and Biostatistics, McMaster University, 1280 Main Street West, Hamilton, ON, Canada.
- Department of Pathology and Molecular Medicine, McMaster University, 1280 Main Street West, Hamilton, ON, Canada.
| |
Collapse
|
30
|
Geskin A, Legowski E, Chakka A, Chandran UR, Barmada MM, LaFramboise WA, Berg J, Jacobson RS. Needs Assessment for Research Use of High-Throughput Sequencing at a Large Academic Medical Center. PLoS One 2015; 10:e0131166. [PMID: 26115441 PMCID: PMC4483235 DOI: 10.1371/journal.pone.0131166] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2014] [Accepted: 05/29/2015] [Indexed: 12/19/2022] Open
Abstract
Next Generation Sequencing (NGS) methods are driving profound changes in biomedical research, with a growing impact on patient care. Many academic medical centers are evaluating potential models to prepare for the rapid increase in NGS information needs. This study sought to investigate (1) how and where sequencing data is generated and analyzed, (2) research objectives and goals for NGS, (3) workforce capacity and unmet needs, (4) storage capacity and unmet needs, (5) available and anticipated funding resources, and (6) future challenges. As a precursor to informed decision making at our institution, we undertook a systematic needs assessment of investigators using survey methods. We recruited 331 investigators from over 60 departments and divisions at the University of Pittsburgh Schools of Health Sciences and had 140 respondents, or a 42% response rate. Results suggest that both sequencing and analysis bottlenecks currently exist. Significant educational needs were identified, including both investigator-focused needs, such as selection of NGS methods suitable for specific research objectives, and program-focused needs, such as support for training an analytic workforce. The absence of centralized infrastructure was identified as an important institutional gap. Key principles for organizations managing this change were formulated based on the survey responses. This needs assessment provides an in-depth case study which may be useful to other academic medical centers as they identify and plan for future needs.
Collapse
Affiliation(s)
- Albert Geskin
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Elizabeth Legowski
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Anish Chakka
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America
| | - Uma R Chandran
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America
| | - M. Michael Barmada
- Institute for Personalized Medicine, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- Department of Human Genetics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania, United States of America
| | - William A. LaFramboise
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America
| | - Jeremy Berg
- Institute for Personalized Medicine, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Rebecca S. Jacobson
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- University of Pittsburgh Cancer Institute, Pittsburgh, Pennsylvania, United States of America
- Institute for Personalized Medicine, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|
31
|
Abstract
Molecular informatics (MI) is an evolving discipline that will support the dynamic landscape of molecular pathology and personalized medicine. MI provides a fertile ground for development of clinical solutions to bridge the gap between clinical informatics and bioinformatics. Rapid adoption of next generation sequencing (NGS) in the clinical arena has triggered major endeavors in MI that are expected to bring a paradigm shift in the practice of pathology. This brief review presents a broad overview of various aspects of MI, particularly in the context of NGS based testing.
Collapse
Affiliation(s)
- Somak Roy
- Department of Pathology, Molecular and Genomic Pathology, University of Pittsburgh Medical Center, 3477 Euler way, Pittsburgh, PA 15213, USA.
| |
Collapse
|
32
|
|
33
|
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype-phenotype interactions. Nat Rev Genet 2015; 16:85-97. [PMID: 25582081 DOI: 10.1038/nrg3868] [Citation(s) in RCA: 531] [Impact Index Per Article: 59.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration - including meta-dimensional and multi-staged analyses - which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.
Collapse
Affiliation(s)
- Marylyn D Ritchie
- Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Emily R Holzinger
- National Human Genome Research Institute, Inherited Disease Research Branch, Baltimore, Maryland 21224, USA
| | - Ruowang Li
- Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Sarah A Pendergrass
- Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Dokyoon Kim
- Department of Biochemistry and Molecular Biology, Center for Systems Genomics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| |
Collapse
|
34
|
Christoph J, Griebel L, Leb I, Engel I, Köpcke F, Toddenroth D, Prokosch HU, Laufer J, Marquardt K, Sedlmayr M. Secure Secondary Use of Clinical Data with Cloud-based NLP Services. Towards a Highly Scalable Research Infrastructure. Methods Inf Med 2014; 54:276-82. [PMID: 25377309 DOI: 10.3414/me13-01-0133] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2013] [Accepted: 10/08/2014] [Indexed: 01/26/2023]
Abstract
OBJECTIVES The secondary use of clinical data provides large opportunities for clinical and translational research as well as quality assurance projects. For such purposes, it is necessary to provide a flexible and scalable infrastructure that is compliant with privacy requirements. The major goals of the cloud4health project are to define such an architecture, to implement a technical prototype that fulfills these requirements and to evaluate it with three use cases. METHODS The architecture provides components for multiple data provider sites such as hospitals to extract free text as well as structured data from local sources and de-identify such data for further anonymous or pseudonymous processing. Free text documentation is analyzed and transformed into structured information by text-mining services, which are provided within a cloud-computing environment. Thus, newly gained annotations can be integrated along with the already available structured data items and the resulting data sets can be uploaded to a central study portal for further analysis. RESULTS Based on the architecture design, a prototype has been implemented and is under evaluation in three clinical use cases. Data from several hundred patients provided by a University Hospital and a private hospital chain have already been processed. CONCLUSIONS Cloud4health has shown how existing components for secondary use of structured data can be complemented with text-mining in a privacy compliant manner. The cloud-computing paradigm allows a flexible and dynamically adaptable service provision that facilitates the adoption of services by data providers without own investments in respective hardware resources and software tools.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | - M Sedlmayr
- Dr. Martin Sedlmayr, Lehrstuhl für Medizinische Informatik, Friedrich-Alexander-Universität Erlangen-Nürnberg, Wetterkreuz 13, 91058 Erlangen, Germany, E-mail:
| |
Collapse
|
35
|
Genomic cloud computing: legal and ethical points to consider. Eur J Hum Genet 2014; 23:1271-8. [PMID: 25248396 PMCID: PMC4592072 DOI: 10.1038/ejhg.2014.196] [Citation(s) in RCA: 60] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2014] [Revised: 08/04/2014] [Accepted: 08/19/2014] [Indexed: 11/08/2022] Open
Abstract
The biggest challenge in twenty-first century data-intensive genomic science, is developing vast computer infrastructure and advanced software tools to perform comprehensive analyses of genomic data sets for biomedical research and clinical practice. Researchers are increasingly turning to cloud computing both as a solution to integrate data from genomics, systems biology and biomedical data mining and as an approach to analyze data to solve biomedical problems. Although cloud computing provides several benefits such as lower costs and greater efficiency, it also raises legal and ethical issues. In this article, we discuss three key 'points to consider' (data control; data security, confidentiality and transfer; and accountability) based on a preliminary review of several publicly available cloud service providers' Terms of Service. These 'points to consider' should be borne in mind by genomic research organizations when negotiating legal arrangements to store genomic data on a large commercial cloud service provider's servers. Diligent genomic cloud computing means leveraging security standards and evaluation processes as a means to protect data and entails many of the same good practices that researchers should always consider in securing their local infrastructure.
Collapse
|