1
|
Karim MR, Michel A, Zappa A, Baranov P, Sahay R, Rebholz-Schuhmann D. Improving data workflow systems with cloud services and use of open data for bioinformatics research. Brief Bioinform 2019; 19:1035-1050. [PMID: 28419324 PMCID: PMC6169675 DOI: 10.1093/bib/bbx039] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Indexed: 11/22/2022] Open
Abstract
Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.
Collapse
Affiliation(s)
- Md Rezaul Karim
- Semantics in eHealth and Life Sciences (SeLS), Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland
| | - Audrey Michel
- School of Biochemistry and Cell Biology, University College Cork, Ireland
| | - Achille Zappa
- Insight Centre for Data Analytics, National University of Ireland Galway, Dangan, Galway, Ireland
| | - Pavel Baranov
- School of Biochemistry and Cell Biology, University College Cork, Ireland
| | - Ratnesh Sahay
- Semantics in eHealth and Life Sciences (SeLS), Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland
| | | |
Collapse
|
2
|
Torri F, Dinov ID, Zamanyan A, Hobel S, Genco A, Petrosyan P, Clark AP, Liu Z, Eggert P, Pierce J, Knowles JA, Ames J, Kesselman C, Toga AW, Potkin SG, Vawter MP, Macciardi F. Next generation sequence analysis and computational genomics using graphical pipeline workflows. Genes (Basel) 2014; 3:545-75. [PMID: 23139896 PMCID: PMC3490498 DOI: 10.3390/genes3030545] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Whole-genome and exome sequencing have already proven to be essential and powerful methods to identify genes responsible for simple Mendelian inherited disorders. These methods can be applied to complex disorders as well, and have been adopted as one of the current mainstream approaches in population genetics. These achievements have been made possible by next generation sequencing (NGS) technologies, which require substantial bioinformatics resources to analyze the dense and complex sequence data. The huge analytical burden of data from genome sequencing might be seen as a bottleneck slowing the publication of NGS papers at this time, especially in psychiatric genetics. We review the existing methods for processing NGS data, to place into context the rationale for the design of a computational resource. We describe our method, the Graphical Pipeline for Computational Genomics (GPCG), to perform the computational steps required to analyze NGS data. The GPCG implements flexible workflows for basic sequence alignment, sequence data quality control, single nucleotide polymorphism analysis, copy number variant identification, annotation, and visualization of results. These workflows cover all the analytical steps required for NGS data, from processing the raw reads to variant calling and annotation. The current version of the pipeline is freely available at http://pipeline.loni.ucla.edu. These applications of NGS analysis may gain clinical utility in the near future (e.g., identifying miRNA signatures in diseases) when the bioinformatics approach is made feasible. Taken together, the annotation tools and strategies that have been developed to retrieve information and test hypotheses about the functional role of variants present in the human genome will help to pinpoint the genetic risk factors for psychiatric disorders.
Collapse
Affiliation(s)
- Federica Torri
- Department of Psychiatry and Human Behavior, University of California, Irvine, CA 92617, USA; E-Mails: (F.T.); (S.G.P.)
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
| | - Ivo D. Dinov
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Alen Zamanyan
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Sam Hobel
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Alex Genco
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Petros Petrosyan
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Andrew P. Clark
- Zilkha Neurogenetic Institute, USC Keck School of Medicine, Los Angeles, CA 90033, USA; E-Mails: (A.P.C.); (J.A.K.)
| | - Zhizhong Liu
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Paul Eggert
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Jonathan Pierce
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - James A. Knowles
- Zilkha Neurogenetic Institute, USC Keck School of Medicine, Los Angeles, CA 90033, USA; E-Mails: (A.P.C.); (J.A.K.)
| | - Joseph Ames
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
| | - Carl Kesselman
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
| | - Arthur W. Toga
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
- Laboratory of Neuro Imaging (LONI), University of California, Los Angeles, CA 90095, USA; E-Mails: (A.Z.); (S.H.); (A.G.); (P.P.); (Z.L.); (P.E.); (J.P.)
| | - Steven G. Potkin
- Department of Psychiatry and Human Behavior, University of California, Irvine, CA 92617, USA; E-Mails: (F.T.); (S.G.P.)
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
| | - Marquis P. Vawter
- Functional Genomics Laboratory, Department of Psychiatry And Human Behavior, School of Medicine, University of California, Irvine, CA 92697, USA; E-Mail:
| | - Fabio Macciardi
- Department of Psychiatry and Human Behavior, University of California, Irvine, CA 92617, USA; E-Mails: (F.T.); (S.G.P.)
- Biomedical Informatics Research Network (BIRN), Information Sciences Institute, University of Southern California, Los Angeles, CA 90292, USA; E-Mails: (I.D.D.); (J.A.); (C.K.); (A.W.T.)
- Author to whom correspondence should be addressed; E-Mail: ; Tel.: +1-949-824-4559; Fax: +1-949-824-2072
| |
Collapse
|
3
|
Jung KS, Moon S, Kim YJ, Kim BJ, Park K. Genovar: a detection and visualization tool for genomic variants. BMC Bioinformatics 2012; 13 Suppl 7:S12. [PMID: 22594998 PMCID: PMC3348018 DOI: 10.1186/1471-2105-13-s7-s12] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Along with single nucleotide polymorphisms (SNPs), copy number variation (CNV) is considered an important source of genetic variation associated with disease susceptibility. Despite the importance of CNV, the tools currently available for its analysis often produce false positive results due to limitations such as low resolution of array platforms, platform specificity, and the type of CNV. To resolve this problem, spurious signals must be separated from true signals by visual inspection. None of the previously reported CNV analysis tools support this function and the simultaneous visualization of comparative genomic hybridization arrays (aCGH) and sequence alignment. The purpose of the present study was to develop a useful program for the efficient detection and visualization of CNV regions that enables the manual exclusion of erroneous signals. RESULTS A JAVA-based stand-alone program called Genovar was developed. To ascertain whether a detected CNV region is a novel variant, Genovar compares the detected CNV regions with previously reported CNV regions using the Database of Genomic Variants (DGV, http://projects.tcag.ca/variation) and the Single Nucleotide Polymorphism Database (dbSNP). The current version of Genovar is capable of visualizing genomic data from sources such as the aCGH data file and sequence alignment format files. CONCLUSIONS Genovar is freely accessible and provides a user-friendly graphic user interface (GUI) to facilitate the detection of CNV regions. The program also provides comprehensive information to help in the elimination of spurious signals by visual inspection, making Genovar a valuable tool for reducing false positive CNV results. AVAILABILITY http://genovar.sourceforge.net/.
Collapse
Affiliation(s)
- Kwang Su Jung
- Division of Bio-Medical Informatics, Center for Genome Science, Korea National Institute of Health, Osong, 363-951, Korea
| | | | | | | | | |
Collapse
|