1
|
Saparov A, Zech M. Big data and transformative bioinformatics in genomic diagnostics and beyond. Parkinsonism Relat Disord 2025; 134:107311. [PMID: 39924354 DOI: 10.1016/j.parkreldis.2025.107311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 01/23/2025] [Accepted: 01/25/2025] [Indexed: 02/11/2025]
Abstract
The current era of high-throughput analysis-driven research offers invaluable insights into disease etiologies, accurate diagnostics, pathogenesis, and personalized therapy. In the field of movement disorders, investigators are facing an increasing growth in the volume of produced patient-derived datasets, providing substantial opportunities for precision medicine approaches based on extensive information accessibility and advanced annotation practices. Integrating data from multiple sources, including phenomics, genomics, and multi-omics, is crucial for comprehensively understanding different types of movement disorders. Here, we explore formats and analytics of big data generated for patients with movement disorders, including strategies to meaningfully share the data for optimized patient benefit. We review computational methods that are essential to accelerate the process of evaluating the increasing amounts of specialized data collected. Based on concrete examples, we highlight how bioinformatic approaches facilitate the translation of multidimensional biological information into clinically relevant knowledge. Moreover, we outline the feasibility of computer-aided therapeutic target evaluation, and we discuss the importance of expanding the focus of big data research to understudied phenotypes such as dystonia.
Collapse
Affiliation(s)
- Alice Saparov
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany; Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany; Institute for Advanced Study, Technical University of Munich, Garching, Germany
| | - Michael Zech
- Institute of Human Genetics, Technical University of Munich, School of Medicine and Health, Munich, Germany; Institute of Neurogenomics, Helmholtz Munich, Neuherberg, Germany; Institute for Advanced Study, Technical University of Munich, Garching, Germany.
| |
Collapse
|
2
|
Kamaraj V, Sinha H. SCI-VCF: a cross-platform GUI solution to summarize, compare, inspect and visualize the variant call format. NAR Genom Bioinform 2024; 6:lqae083. [PMID: 38984067 PMCID: PMC11231579 DOI: 10.1093/nargab/lqae083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 06/03/2024] [Accepted: 07/01/2024] [Indexed: 07/11/2024] Open
Abstract
As genomics advances swiftly and its applications extend to diverse fields, bioinformatics tools must enable researchers and clinicians to work with genomic data irrespective of their programming expertise. We developed SCI-VCF, a Shiny-based comprehensive analysis utility to summarize, compare, inspect, analyse and design interactive visualizations of the genetic variants from the variant call format. With an intuitive graphical user interface, SCI-VCF aims to bridge the approachability gap in genomics that arises from the existing predominantly command-line utilities. SCI-VCF is written in R and is freely available at https://doi.org/10.5281/zenodo.11453080. For installation-free access, users can avail themselves of an online version at https://ibse.shinyapps.io/sci-vcf-online.
Collapse
Affiliation(s)
- Venkatesh Kamaraj
- Centre for Integrative Biology and Systems Medicine (IBSE), IIT Madras, Chennai 600036, Tamil Nadu, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai 600036, Tamil Nadu, India
| | - Himanshu Sinha
- Centre for Integrative Biology and Systems Medicine (IBSE), IIT Madras, Chennai 600036, Tamil Nadu, India
- Robert Bosch Centre for Data Science and Artificial Intelligence (RBCDSAI), IIT Madras, Chennai 600036, Tamil Nadu, India
- Wadhwani School of Data Science and Artificial Intelligence, IIT Madras, Chennai 600036, Tamil Nadu, India
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, IIT Madras, Chennai 600036, Tamil Nadu, India
| |
Collapse
|
3
|
Hassan J, Saeed SM, Deka L, Uddin MJ, Das DB. Applications of Machine Learning (ML) and Mathematical Modeling (MM) in Healthcare with Special Focus on Cancer Prognosis and Anticancer Therapy: Current Status and Challenges. Pharmaceutics 2024; 16:260. [PMID: 38399314 PMCID: PMC10892549 DOI: 10.3390/pharmaceutics16020260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/29/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open
Abstract
The use of data-driven high-throughput analytical techniques, which has given rise to computational oncology, is undisputed. The widespread use of machine learning (ML) and mathematical modeling (MM)-based techniques is widely acknowledged. These two approaches have fueled the advancement in cancer research and eventually led to the uptake of telemedicine in cancer care. For diagnostic, prognostic, and treatment purposes concerning different types of cancer research, vast databases of varied information with manifold dimensions are required, and indeed, all this information can only be managed by an automated system developed utilizing ML and MM. In addition, MM is being used to probe the relationship between the pharmacokinetics and pharmacodynamics (PK/PD interactions) of anti-cancer substances to improve cancer treatment, and also to refine the quality of existing treatment models by being incorporated at all steps of research and development related to cancer and in routine patient care. This review will serve as a consolidation of the advancement and benefits of ML and MM techniques with a special focus on the area of cancer prognosis and anticancer therapy, leading to the identification of challenges (data quantity, ethical consideration, and data privacy) which are yet to be fully addressed in current studies.
Collapse
Affiliation(s)
- Jasmin Hassan
- Drug Delivery & Therapeutics Lab, Dhaka 1212, Bangladesh; (J.H.); (S.M.S.)
| | | | - Lipika Deka
- Faculty of Computing, Engineering and Media, De Montfort University, Leicester LE1 9BH, UK;
| | - Md Jasim Uddin
- Department of Pharmaceutical Technology, Faculty of Pharmacy, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| | - Diganta B. Das
- Department of Chemical Engineering, Loughborough University, Loughborough LE11 3TU, UK
| |
Collapse
|
4
|
Redekar SS, Varma SL, Bhattacharjee A. Gene co-expression network construction and analysis for identification of genetic biomarkers associated with glioblastoma multiforme using topological findings. J Egypt Natl Canc Inst 2023; 35:22. [PMID: 37482563 DOI: 10.1186/s43046-023-00181-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 07/05/2023] [Indexed: 07/25/2023] Open
Abstract
BACKGROUND Glioblastoma multiforme (GBM) is one of the most malignant types of central nervous system tumors. GBM patients usually have a poor prognosis. Identification of genes associated with the progression of the disease is essential to explain the mechanisms or improve the prognosis of GBM by catering to targeted therapy. It is crucial to develop a methodology for constructing a biological network and analyze it to identify potential biomarkers associated with disease progression. METHODS Gene expression datasets are obtained from TCGA data repository to carry out this study. A survival analysis is performed to identify survival associated genes of GBM patient. A gene co-expression network is constructed based on Pearson correlation between the gene's expressions. Various topological measures along with set operations from graph theory are applied to identify most influential genes linked with the progression of the GBM. RESULTS Ten key genes are identified as a potential biomarkers associated with GBM based on centrality measures applied to the disease network. These genes are SEMA3B, APS, SLC44A2, MARK2, PITPNM2, SFRP1, PRLH, DIP2C, CTSZ, and KRTAP4.2. Higher expression values of two genes, SLC44A2 and KRTAP4.2 are found to be associated with progression and lower expression values of seven gens SEMA3B, APS, MARK2, PITPNM2, SFRP1, PRLH, DIP2C, and CTSZ are linked with the progression of the GBM. CONCLUSIONS The proposed methodology employing a network topological approach to identify genetic biomarkers associated with cancer.
Collapse
Affiliation(s)
- Seema Sandeep Redekar
- Pillai College of Engineering, New Panvel, Mumbai, India.
- SIES Graduate School of Technology, Navi Mumbai, Mumbai, India.
| | | | | |
Collapse
|
5
|
Oza VH, Whitlock JH, Wilk EJ, Uno-Antonison A, Wilk B, Gajapathy M, Howton TC, Trull A, Ianov L, Worthey EA, Lasseigne BN. Ten simple rules for using public biological data for your research. PLoS Comput Biol 2023; 19:e1010749. [PMID: 36602970 DOI: 10.1371/journal.pcbi.1010749] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
With an increasing amount of biological data available publicly, there is a need for a guide on how to successfully download and use this data. The 10 simple rules for using public biological data are: (1) use public data purposefully in your research; (2) evaluate data for your use case; (3) check data reuse requirements and embargoes; (4) be aware of ethics for data reuse; (5) plan for data storage and compute requirements; (6) know what you are downloading; (7) download programmatically and verify integrity; (8) properly cite data; (9) make reprocessed data and models Findable, Accessible, Interoperable, and Reusable (FAIR) and share; and (10) make pipelines and code FAIR and share. These rules are intended as a guide for researchers wanting to make use of available data and to increase data reuse and reproducibility.
Collapse
Affiliation(s)
- Vishal H Oza
- Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Jordan H Whitlock
- Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Elizabeth J Wilk
- Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Angelina Uno-Antonison
- Center for Computational Genomics and Data Sciences, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pediatrics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pathology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Brandon Wilk
- Center for Computational Genomics and Data Sciences, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pediatrics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pathology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Manavalan Gajapathy
- Center for Computational Genomics and Data Sciences, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pediatrics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pathology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Timothy C Howton
- Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Austyn Trull
- Center for Computational Genomics and Data Sciences, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pediatrics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pathology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Lara Ianov
- Civitan International Research Center, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Elizabeth A Worthey
- Center for Computational Genomics and Data Sciences, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pediatrics, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
- Department of Pathology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Brittany N Lasseigne
- Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| |
Collapse
|
6
|
Vuong P, Wise MJ, Whiteley AS, Kaur P. Ten simple rules for investigating (meta)genomic data from environmental ecosystems. PLoS Comput Biol 2022; 18:e1010675. [PMID: 36480496 PMCID: PMC9731419 DOI: 10.1371/journal.pcbi.1010675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Affiliation(s)
- Paton Vuong
- UWA School of Agriculture & Environment, University of Western Australia, Perth, Australia
| | - Michael J. Wise
- School of Physics, Mathematics and Computing, University of Western Australia, Perth, Australia
- The Marshall Centre of Infectious Diseases, School of Biological Sciences, The University of Western Australia, Perth, Australia
| | - Andrew S. Whiteley
- Centre for Environment & Life Sciences, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Floreat, Australia
| | - Parwinder Kaur
- UWA School of Agriculture & Environment, University of Western Australia, Perth, Australia
- * E-mail:
| |
Collapse
|
7
|
Labani M, Beheshti A, Lovell NH, Alinejad-Rokny H, Afrasiabi A. KARAJ: An Efficient Adaptive Multi-Processor Tool to Streamline Genomic and Transcriptomic Sequence Data Acquisition. Int J Mol Sci 2022; 23:14418. [PMID: 36430895 PMCID: PMC9694301 DOI: 10.3390/ijms232214418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Revised: 11/15/2022] [Accepted: 11/17/2022] [Indexed: 11/22/2022] Open
Abstract
Here we developed KARAJ, a fast and flexible Linux command-line tool to automate the end-to-end process of querying and downloading a wide range of genomic and transcriptomic sequence data types. The input to KARAJ is a list of PMCIDs or publication URLs or various types of accession numbers to automate four tasks as follows; firstly, it provides a summary list of accessible datasets generated by or used in these scientific articles, enabling users to select appropriate datasets; secondly, KARAJ calculates the size of files that users want to download and confirms the availability of adequate space on the local disk; thirdly, it generates a metadata table containing sample information and the experimental design of the corresponding study; and lastly, it enables users to download supplementary data tables attached to publications. Further, KARAJ provides a parallel downloading framework powered by Aspera connect which reduces the downloading time significantly.
Collapse
Affiliation(s)
- Mahdieh Labani
- Biomedical Machine Learning Lab, The Graduate School of Biomedical Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia
- Data Analytics Lab, Department of Computing, Macquarie University, Sydney, NSW 2109, Australia
| | - Amin Beheshti
- Data Analytics Lab, Department of Computing, Macquarie University, Sydney, NSW 2109, Australia
| | - Nigel H. Lovell
- The Graduate School of Biomedical Engineering (GSBmE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia
- Tyree Institute of Health Engineering (IHealthE), University of New South Wales (UNSW), Sydney, NSW 2052, Australia
| | - Hamid Alinejad-Rokny
- Biomedical Machine Learning Lab, The Graduate School of Biomedical Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia
- UNSW Data Science Hub, University of New South Wales (UNSW), Sydney, NSW 2052, Australia
- Health Data Analytics Program, Centre for Applied Artificial Intelligence, Macquarie University, Sydney, NSW 2109, Australia
| | - Ali Afrasiabi
- Biomedical Machine Learning Lab, The Graduate School of Biomedical Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia
- Centre for Immunology and Allergy Research, Westmead Institute for Medical Research, University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
8
|
Lang M, Zawati MH. Returning individual research results in international direct-to-participant genomic research: results from a 31-country study. Eur J Hum Genet 2022; 30:1132-1137. [PMID: 35478220 PMCID: PMC9553878 DOI: 10.1038/s41431-022-01103-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 03/25/2022] [Accepted: 04/12/2022] [Indexed: 12/15/2022] Open
Abstract
This paper summarizes the results of a 31-country qualitative study of expert perspectives on the regulation of international "direct-to-participant" (DTP) genomic research. We outline how the practice of directly recruiting participants for genomic studies online complicates ethics and regulatory considerations for the return of individual research results. As part of a larger project supported by the National Human Genome Research Institute, National Institutes of Health, we prepared and distributed to 31 global legal experts a questionnaire intended to ascertain opinions and perspectives on the way international DTP genomic research is likely to be regulated. We found significant disagreement across jurisdictions on the most favorable approach to managing such results, with some countries favoring return by default and others preferring to return only with the express consent of research participants. We conclude by outlining policy considerations that should guide researcher practices in this context. As international DTP genomic research evolves, jurists and ethicists should be attentive to the ways novel approaches to subject recruitment align with existing ethical and regulatory norms in research with human participants. This paper is a preliminary step toward documenting such alignment in the context of the return of individual research results.
Collapse
Affiliation(s)
- Michael Lang
- Faculty of Medicine and Health Sciences, Centre of Genomics and Policy, McGill University, Montreal, QC, Canada
| | - Ma'n H Zawati
- Faculty of Medicine and Health Sciences, Centre of Genomics and Policy, McGill University, Montreal, QC, Canada.
| |
Collapse
|
9
|
Dzinovic I, Boesch S, Škorvánek M, Necpál J, Švantnerová J, Pavelekova P, Havránková P, Tsoma E, Indelicato E, Runkel E, Held V, Weise D, Janzarik W, Eckenweiler M, Berweck S, Mall V, Haslinger B, Jech R, Winkelmann J, Zech M. Genetic overlap between dystonia and other neurologic disorders: A study of 1,100 exomes. Parkinsonism Relat Disord 2022; 102:1-6. [PMID: 35872528 DOI: 10.1016/j.parkreldis.2022.07.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 06/29/2022] [Accepted: 07/08/2022] [Indexed: 10/17/2022]
Abstract
INTRODUCTION Although shared genetic factors have been previously reported between dystonia and other neurologic conditions, no sequencing study exploring such links is available. In a large dystonic cohort, we aimed at analyzing the proportions of causative variants in genes associated with disease categories other than dystonia. METHODS Gene findings related to whole-exome sequencing-derived diagnoses in 1100 dystonia index cases were compared with expert-curated molecular testing panels for ataxia, parkinsonism, spastic paraplegia, neuropathy, epilepsy, and intellectual disability. RESULTS Among 220 diagnosed patients, 21% had variants in ataxia-linked genes; 15% in parkinsonism-linked genes; 15% in spastic-paraplegia-linked genes; 12% in neuropathy-linked genes; 32% in epilepsy-linked genes; and 65% in intellectual-disability-linked genes. Most diagnosed presentations (80%) were related to genes listed in ≥1 studied panel; 71% of the involved loci were found in the non-dystonia panels but not in an expert-curated gene list for dystonia. CONCLUSIONS Our study indicates a convergence in the genetics of dystonia and other neurologic phenotypes, informing diagnostic evaluation strategies and pathophysiological considerations.
Collapse
Affiliation(s)
- Ivana Dzinovic
- Institute of Neurogenomics, Helmholtz Zentrum München, Munich, Germany; Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
| | - Sylvia Boesch
- Department of Neurology, Medical University of Innsbruck, Innsbruck, Austria
| | - Matej Škorvánek
- Department of Neurology, P.J. Safarik University, Kosice, Slovak Republic; Department of Neurology, University Hospital of L. Pasteur, Kosice, Slovak Republic
| | - Ján Necpál
- Department of Neurology, Zvolen Hospital, Slovakia
| | - Jana Švantnerová
- Second Department of Neurology, Faculty of Medicine, Comenius University, University Hospital Bratislava, Bratislava, Slovakia
| | - Petra Pavelekova
- Department of Neurology, P.J. Safarik University, Kosice, Slovak Republic; Department of Neurology, University Hospital of L. Pasteur, Kosice, Slovak Republic
| | - Petra Havránková
- Department of Neurology, Charles University, 1st Faculty of Medicine and General University Hospital in Prague, Prague, Czech Republic
| | - Eugenia Tsoma
- Regional Clinical Center of Neurosurgery and Neurology, Department of Family Medicine and Outpatient Care, Uzhhorod National University, Uzhhorod, Ukraine
| | | | - Eva Runkel
- Klinikum Aschaffenburg-Alzenau, Aschaffenburg, Germany
| | - Valentin Held
- Department of Neurology, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - David Weise
- Klinik für Neurologie, Asklepios Fachklinikum Stadtroda, Stadtroda, Germany; Department of Neurology, University of Leipzig, Leipzig, Germany
| | - Wibke Janzarik
- Department of Neuropediatrics and Muscle Disorders, University Medical Center, Faculty of Medicine, University of Freiburg, Germany
| | - Matthias Eckenweiler
- Department of Neuropediatrics and Muscle Disorders, University Medical Center, Faculty of Medicine, University of Freiburg, Germany
| | - Steffen Berweck
- Ludwig Maximilian University of Munich, Munich, Germany; Hospital for Neuropediatrics and Neurological Rehabilitation, Centre of Epilepsy for Children and Adolescents, Schoen Klinik Vogtareuth, Vogtareuth, Germany
| | - Volker Mall
- Lehrstuhl für Sozialpädiatrie, Technische Universität München, Munich, Germany; kbo-Kinderzentrum München, Munich, Germany
| | - Bernhard Haslinger
- Department of Neurology, Klinikum rechts der Isar, Technical University of Munich, School of Medicine, Munich, Germany
| | - Robert Jech
- Department of Neurology, Charles University, 1st Faculty of Medicine and General University Hospital in Prague, Prague, Czech Republic
| | - Juliane Winkelmann
- Institute of Neurogenomics, Helmholtz Zentrum München, Munich, Germany; Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany; Lehrstuhl für Neurogenetik, Technische Universität München, Munich, Germany; Munich Cluster for Systems Neurology, SyNergy, Munich, Germany
| | - Michael Zech
- Institute of Neurogenomics, Helmholtz Zentrum München, Munich, Germany; Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
| |
Collapse
|
10
|
Masoumi S, Libbrecht MW, Wiese KC. SigTools: exploratory visualization for genomic signals. Bioinformatics 2022; 38:1126-1128. [PMID: 34718413 DOI: 10.1093/bioinformatics/btab742] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Revised: 09/29/2021] [Accepted: 10/25/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION With the advancement of sequencing technologies, genomic data sets are constantly being expanded by high volumes of different data types. One recently introduced data type in genomic science is genomic signals, which are usually short-read coverage measurements over the genome. To understand and evaluate the results of such studies, one needs to understand and analyze the characteristics of the input data. RESULTS SigTools is an R-based genomic signals visualization package developed with two objectives: (i) to facilitate genomic signals exploration in order to uncover insights for later model training, refinement and development by including distribution and autocorrelation plots; (ii) to enable genomic signals interpretation by including correlation and aggregation plots. In addition, our corresponding web application, SigTools-Shiny, extends the accessibility scope of these modules to people who are more comfortable working with graphical user interfaces instead of command-line tools. AVAILABILITY AND IMPLEMENTATION SigTools source code, installation guide and manual is freely available on http://github.com/shohre73.
Collapse
Affiliation(s)
- Shohre Masoumi
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Kay C Wiese
- School of Computing Science, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
11
|
Stephan T, Burgess SM, Cheng H, Danko CG, Gill CA, Jarvis ED, Koepfli KP, Koltes JE, Lyons E, Ronald P, Ryder OA, Schriml LM, Soltis P, VandeWoude S, Zhou H, Ostrander EA, Karlsson EK. Darwinian genomics and diversity in the tree of life. Proc Natl Acad Sci U S A 2022; 119:e2115644119. [PMID: 35042807 PMCID: PMC8795533 DOI: 10.1073/pnas.2115644119] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Genomics encompasses the entire tree of life, both extinct and extant, and the evolutionary processes that shape this diversity. To date, genomic research has focused on humans, a small number of agricultural species, and established laboratory models. Fewer than 18,000 of ∼2,000,000 eukaryotic species (<1%) have a representative genome sequence in GenBank, and only a fraction of these have ancillary information on genome structure, genetic variation, gene expression, epigenetic modifications, and population diversity. This imbalance reflects a perception that human studies are paramount in disease research. Yet understanding how genomes work, and how genetic variation shapes phenotypes, requires a broad view that embraces the vast diversity of life. We have the technology to collect massive and exquisitely detailed datasets about the world, but expertise is siloed into distinct fields. A new approach, integrating comparative genomics with cell and evolutionary biology, ecology, archaeology, anthropology, and conservation biology, is essential for understanding and protecting ourselves and our world. Here, we describe potential for scientific discovery when comparative genomics works in close collaboration with a broad range of fields as well as the technical, scientific, and social constraints that must be addressed.
Collapse
Affiliation(s)
- Taylorlyn Stephan
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20817
| | - Shawn M Burgess
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20817
| | - Hans Cheng
- Avian Disease and Oncology Laboratory, Agricultural Research Service, US Department of Agriculture, East Lansing, MI 48823
| | - Charles G Danko
- Department of Biomedical Sciences, Baker Institute for Animal Health, Cornell University, Ithaca, NY 14850
| | - Clare A Gill
- Department of Animal Science, Texas A&M University, College Station, TX 77843
| | - Erich D Jarvis
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY 10065
- HHMI, Chevy Chase, MD 20815
| | - Klaus-Peter Koepfli
- Smithsonian-Mason School of Conservation, George Mason University, Front Royal, VA 22630
- Smithsonian Conservation Biology Institute, National Zoological Park, Washington, DC 20008
| | - James E Koltes
- Department of Animal Science, Iowa State University, Ames, IA 50011
| | - Eric Lyons
- School of Plant Sciences, BIO5 Institute, University of Arizona, Tucson, AZ 85721
| | - Pamela Ronald
- Department of Plant Pathology, University of California, Davis, CA 95616
- The Genome Center, University of California, Davis, CA 95616
- The Innovative Genomics Institute, University of California, Berkeley, CA 94720
- Grass Genetics, Joint Bioenergy Institute, Emeryville, CA 94608
| | - Oliver A Ryder
- San Diego Zoo Wildlife Alliance, Escondido, CA 92027
- Department of Evolution, Behavior, and Ecology, University of California San Diego, La Jolla, CA 92093
| | - Lynn M Schriml
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201
| | - Pamela Soltis
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611
| | - Sue VandeWoude
- Department of Micro-, Immuno-, and Pathology, Colorado State University, Fort Collins, CO 80532
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA 95616
| | - Elaine A Ostrander
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20817
| | - Elinor K Karlsson
- Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01655;
- Program in Molecular Medicine, University of Massachusetts Medical School, Worcester, MA 01655
- Broad Institute of MIT and Harvard, Cambridge, MA 02142
| |
Collapse
|
12
|
Zadissa A, Apweiler R. Data Mining, Quality and Management in the Life Sciences. Methods Mol Biol 2022; 2449:3-25. [PMID: 35507257 DOI: 10.1007/978-1-0716-2095-3_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
With the evermore emphasis put on open science and its invaluable benefits to the scientific community, it is no longer the case where a research project simply ends with a scientific publication. The benefits of data sharing and reproducibility of results have taken the centerpiece within the life science research supported by FAIR principles that firmly underline the importance of open data. The current data-intensive multidisciplinary research has also highlighted the significance of how data is mined and managed. Here we describe some of the features adopted by EMBL-EBI data resources to support data mining, data quality, and data management. We also highlight how EMBL-EBI has responded to the current pandemic through its data resources.
Collapse
Affiliation(s)
- Amonida Zadissa
- EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK.
| | - Rolf Apweiler
- EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| |
Collapse
|
13
|
Lee CT, Maragkakis M. SamQL: a structured query language and filtering tool for the SAM/BAM file format. BMC Bioinformatics 2021; 22:474. [PMID: 34600480 PMCID: PMC8487582 DOI: 10.1186/s12859-021-04390-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 09/22/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Sequence Alignment/Map Format Specification (SAM) is one of the most widely adopted file formats in bioinformatics and many researchers use it daily. Several tools, including most high-throughput sequencing read aligners, use it as their primary output and many more tools have been developed to process it. However, despite its flexibility, SAM encoded files can often be difficult to query and understand even for experienced bioinformaticians. As genomic data are rapidly growing, structured, and efficient queries on data that are encoded in SAM/BAM files are becoming increasingly important. Existing tools are very limited in their query capabilities or are not efficient. Critically, new tools that address these shortcomings, should not be able to support existing large datasets but should also do so without requiring massive data transformations and file infrastructure reorganizations. RESULTS Here we introduce SamQL, an SQL-like query language for the SAM format with intuitive syntax that supports complex and efficient queries on top of SAM/BAM files and that can replace commonly used Bash one-liners employed by many bioinformaticians. SamQL has high expressive power with no upper limit on query size and when parallelized, outperforms other substantially less expressive software. CONCLUSIONS SamQL is a complete query language that we envision as a step to a structured database engine for genomics. SamQL is written in Go, and is freely available as standalone program and as an open-source library under an MIT license, https://github.com/maragkakislab/samql/ .
Collapse
Affiliation(s)
- Christopher T Lee
- Laboratory of Genetics and Genomics, National Institute on Aging, Intramural Research Program, National Institutes of Health, Baltimore, MD, 21224, USA
| | - Manolis Maragkakis
- Laboratory of Genetics and Genomics, National Institute on Aging, Intramural Research Program, National Institutes of Health, Baltimore, MD, 21224, USA.
| |
Collapse
|
14
|
Adanur Dedeturk B, Soran A, Bakir-Gungor B. Blockchain for genomics and healthcare: a literature review, current status, classification and open issues. PeerJ 2021; 9:e12130. [PMID: 34703661 PMCID: PMC8487622 DOI: 10.7717/peerj.12130] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 08/17/2021] [Indexed: 11/20/2022] Open
Abstract
The tremendous boost in the next generation sequencing technologies and in the "omics" technologies resulted in the generation of hundreds of gigabytes of data per day. Nowadays, via integrating -omics data with other data types, such as imaging and electronic health record (EHR) data, panomics studies attempt to identify novel and potentially actionable biomarkers for personalized medicine applications. In this respect, for the accurate analysis of -omics data and EHR, there is a need to establish secure and robust pipelines that take the ethical aspects into consideration, regulate privacy and ownership issues, and data sharing. These days, blockchain technology has picked up significant attention in diverse fields, including genomics, since it offers a new solution for these problems from a different perspective. Blockchain is an immutable transaction ledger, which offers secure and distributed system without a central authority. Within the system, each transaction can be expressed with cryptographically signed blocks, and the verification of transactions is performed by the users of the network. In this review, firstly, we aim to highlight the challenges of EHR and genomic data sharing. Secondly, we attempt to answer "Why" or "Why not" the blockchain technology is suitable for genomics and healthcare applications in detail. Thirdly, we elucidate the general blockchain structure based on the Ethereum, which is a more suitable technology for the genomic data sharing platforms. Fourthly, we review current blockchain-based EHR and genomic data sharing platforms, evaluate the advantages and disadvantages of these applications, and classify these applications using different metrics. Finally, we conclude by discussing the open issues and introducing our suggestion on the topic. In summary, to facilitate the diagnosis, monitoring and therapy of diseases with the effective analysis of -omics data with other available data types, through this review, we put forward the possible implications of the blockchain technology to life sciences and healthcare.
Collapse
Affiliation(s)
| | - Ahmet Soran
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| |
Collapse
|
15
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
16
|
Lee S, Lam SH, Hernandes Rocha TA, Fleischman RJ, Staton CA, Taylor R, Limkakeng AT. Machine Learning and Precision Medicine in Emergency Medicine: The Basics. Cureus 2021; 13:e17636. [PMID: 34646684 PMCID: PMC8485701 DOI: 10.7759/cureus.17636] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/01/2021] [Indexed: 12/28/2022] Open
Abstract
As machine learning (ML) and precision medicine become more readily available and used in practice, emergency physicians must understand the potential advantages and limitations of the technology. This narrative review focuses on the key components of machine learning, artificial intelligence, and precision medicine in emergency medicine (EM). Based on the content expertise, we identified articles from EM literature. The authors provided a narrative summary of each piece of literature. Next, the authors provided an introduction of the concepts of ML, artificial intelligence as an extension of ML, and precision medicine. This was followed by concrete examples of their applications in practice and research. Subsequently, we shared our thoughts on how to consume the existing research in these subjects and conduct high-quality research for academic emergency medicine. We foresee that the EM community will continue to adapt machine learning, artificial intelligence, and precision medicine in research and practice. We described several key components using our expertise.
Collapse
Affiliation(s)
- Sangil Lee
- Emergency Medicine, University of Iowa Carver College of Medicine, Iowa City, USA
| | - Samuel H Lam
- Emergency Medicine, Sutter Medical Center, Sacramento, USA
| | | | | | - Catherine A Staton
- Division of Emergency Medicine, Department of Surgery, Duke University School of Medicine, Durham, USA
| | - Richard Taylor
- Department of Emergency Medicine, Yale University, New Haven, USA
| | - Alexander T Limkakeng
- Division of Emergency Medicine, Department of Surgery, Duke University School of Medicine, Durham, USA
| |
Collapse
|
17
|
McEntire KD, Gage M, Gawne R, Hadfield MG, Hulshof C, Johnson MA, Levesque DL, Segura J, Pinter-Wollman N. Understanding Drivers of Variation and Predicting Variability Across Levels of Biological Organization. Integr Comp Biol 2021; 61:2119-2131. [PMID: 34259842 DOI: 10.1093/icb/icab160] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 07/06/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
Differences within a biological system are ubiquitous, creating variation in nature. Variation underlies all evolutionary processes and allows persistence and resilience in changing environments; thus, uncovering the drivers of variation is critical. The growing recognition that variation is central to biology presents a timely opportunity for determining unifying principles that drive variation across biological levels of organization. Currently, most studies that consider variation are focused at a single biological level and not integrated into a broader perspective. Here we explain what variation is and how it can be measured. We then discuss the importance of variation in natural systems, and briefly describe the biological research that has focused on variation. We outline some of the barriers and solutions to studying variation and its drivers in biological systems. Finally, we detail the challenges and opportunities that may arise when studying the drivers of variation due to the multi-level nature of biological systems. Examining the drivers of variation will lead to a reintegration of biology. It will further forge interdisciplinary collaborations and open opportunities for training diverse quantitative biologists. We anticipate that these insights will inspire new questions and new analytic tools to study the fundamental questions of what drives variation in biological systems and how variation has shaped life.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Danielle L Levesque
- University of Maine College of Natural Sciences Forestry and Agriculture, School of Biology and Ecology
| | | | | |
Collapse
|
18
|
Kolla L, Gruber FK, Khalid O, Hill C, Parikh RB. The case for AI-driven cancer clinical trials - The efficacy arm in silico. Biochim Biophys Acta Rev Cancer 2021; 1876:188572. [PMID: 34082064 DOI: 10.1016/j.bbcan.2021.188572] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Revised: 05/19/2021] [Accepted: 05/22/2021] [Indexed: 10/21/2022]
Abstract
Pharmaceutical agents in oncology currently have high attrition rates from early to late phase clinical trials. Recent advances in computational methods, notably causal artificial intelligence, and availability of rich clinico-genomic databases have made it possible to simulate the efficacy of cancer drug protocols in diverse patient populations, which could inform and improve clinical trial design. Here, we review the current and potential use of in silico trials and causal AI to increase the efficacy and safety of traditional clinical trials. We conclude that in silico trials using causal AI approaches can simulate control and efficacy arms, inform patient recruitment and regimen titrations, and better enable subgroup analyses critical for precision medicine.
Collapse
Affiliation(s)
- Likhitha Kolla
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | | | | | | - Ravi B Parikh
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
19
|
Miyachi K, Mackey TK. hOCBS: A privacy-preserving blockchain framework for healthcare data leveraging an on-chain and off-chain system design. Inf Process Manag 2021. [DOI: 10.1016/j.ipm.2021.102535] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
20
|
Abstract
Genomics is both a data- and compute-intensive discipline. The success of genomics depends on an adequate informatics infrastructure that can address growing data demands and enable a diverse range of resource-intensive computational activities. Designing a suitable infrastructure is a challenging task, and its success largely depends on its adoption by users. In this article, we take a user-centric view of the genomics, where users are bioinformaticians, computational biologists, and data scientists. We try to take their point of view on how traditional computational activities for genomics are expanding due to data growth, as well as the introduction of big data and cloud technologies. The changing landscape of computational activities and new user requirements will influence the design of future genomics infrastructures.
Collapse
Affiliation(s)
- Ritesh Krishna
- IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK.,IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK
| | - Vadim Elisseev
- IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK.,IBM Research Europe, The Hartree Centre STFC Laboratory, Warrington WA4 4AD, UK
| |
Collapse
|
21
|
Kanungo S, Barr J, Crutchfield P, Fealko C, Soares N. Ethical Considerations on Pediatric Genetic Testing Results in Electronic Health Records. Appl Clin Inform 2020; 11:755-763. [PMID: 33176390 DOI: 10.1055/s-0040-1718753] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
Abstract
BACKGROUND Advances in technology and access to expanded genetic testing have resulted in more children and adolescents receiving genetic testing for diagnostic and prognostic purposes. With increased adoption of the electronic health record (EHR), genetic testing is increasingly resulted in the EHR. However, this leads to challenges in both storage and disclosure of genetic results, particularly when parental results are combined with child genetic results. PRIVACY AND ETHICAL CONSIDERATIONS Accidental disclosure and erroneous documentation of genetic results can occur due to the nature of their presentation in the EHR and documentation processes by clinicians. Genetic information is both sensitive and identifying, and requires a considered approach to both timing and extent of disclosure to families and access to clinicians. METHODS This article uses an interdisciplinary approach to explore ethical issues surrounding privacy, confidentiality of genetic data, and access to genetic results by health care providers and family members, and provides suggestions in a stakeholder format for best practices on this topic for clinicians and informaticians. Suggestions are made for clinicians on documenting and accessing genetic information in the EHR, and on collaborating with genetics specialists and disclosure of genetic results to families. Additional considerations for families including ethics around results of adolescents and special scenarios for blended families and foster minors are also provided. Finally, administrators and informaticians are provided best practices on both institutional processes and EHR architecture, including security and access control, with emphasis on the minimum necessary paradigm and parent/patient engagement and control of the use and disclosure of data. CONCLUSION The authors hope that these best practices energize specialty societies to craft practice guidelines on genetic information management in the EHR with interdisciplinary input that addresses all stakeholder needs.
Collapse
Affiliation(s)
- Shibani Kanungo
- Pediatric and Adolescent Medicine, Western Michigan University Homer Stryker M.D. School of Medicine, Kalamazoo, Michigan, United States
| | - Jayne Barr
- Internal Medicine-Pediatrics, MetroHealth, Cleveland, Ohio, United States
| | - Parker Crutchfield
- Medical Ethics, Humanities, and Law, Western Michigan University Homer Stryker M.D. School of Medicine, Kalamazoo, Michigan, United States
| | - Casey Fealko
- Pediatric and Adolescent Medicine, Western Michigan University Homer Stryker M.D. School of Medicine, Kalamazoo, Michigan, United States
| | - Neelkamal Soares
- Pediatric and Adolescent Medicine, Western Michigan University Homer Stryker M.D. School of Medicine, Kalamazoo, Michigan, United States
| |
Collapse
|
22
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
23
|
Hubbard A, Bomhoff M, Schmidt CJ. fRNAkenseq: a fully powered-by-CyVerse cloud integrated RNA-sequencing analysis tool. PeerJ 2020; 8:e8592. [PMID: 32461821 PMCID: PMC7231498 DOI: 10.7717/peerj.8592] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Accepted: 01/18/2020] [Indexed: 11/20/2022] Open
Abstract
Background Decreasing costs make RNA sequencing technologies increasingly affordable for biologists. However, many researchers who can now afford sequencing lack access to resources necessary for downstream analysis. This means that even as algorithms to process RNA-Seq data improve, many biologists still struggle to manage the sheer volume of data produced by next generation sequencing (NGS) technologies. Scalable bioinformatics tools that exploit multiple platforms are needed to democratize bioinformatics resources in the sequencing era. This is essential for equipping many research groups in the life sciences with the tools to process the increasingly unwieldy datasets they produce. Methods One strategy to address this challenge is to develop a modern generation of sequence analysis tools capable of seamless data sharing and communication. Such tools will provide interoperability through offerings of interlinked resources. Systems of interlinked, scalable resources, which often incorporate cloud data storage, are broadly referred to as cyberinfrastructure. Cyberinfrastructure integrated tools will help researchers to robustly analyze large scale datasets by efficiently sharing data burdens across a distributed architecture. Additionally, interoperability will allow emerging tools to cross-adapt features of existing tools. It is important that these tools are designed to be easy to use for biologists. Results We introduce fRNAkenseq, a powered-by-CyVerse RNA sequencing analysis tool that exhibits interoperability with other resources and meets the needs of biologists for comprehensive, easy to use RNA sequencing analysis. fRNAkenseq leverages a complex set of Application Programming Interfaces (APIs) associated with the NSF-funded cyberinfrastructure project, CyVerse, to execute FASTQ-to-differential expression RNA-Seq analyses. Integrating across bioinformatics platforms, fRNAkenseq also exploits cloud integration and cross-talk with another CyVerse associated tool, CoGe. fRNAkenseq offers novel features for the biologist such as more robust and comprehensive pipelines for enrichment than those currently available by default in a single tool, whether they are cloud-based or local installation. Importantly, cross-talk with CoGe allows fRNAkenseq users to execute RNA-Seq pipelines on an inventory of 47,000 archived genomes stored in CoGe or upload their own draft genome.
Collapse
Affiliation(s)
- Allen Hubbard
- Donald Danforth Plant Science Center, Saint Louis, MO, USA
| | - Matthew Bomhoff
- Department of Plant and Soil Sciences, University of Arizona, Tucson, AZ, USA
| | - Carl J Schmidt
- Department of Animal and Food Sciences, University of Delaware, Newark, DE, USA
| |
Collapse
|
24
|
|