1
|
Al-Aamri A, Kamarul Azman S, Daw Elbait G, Alsafar H, Henschel A. Critical assessment of on-premise approaches to scalable genome analysis. BMC Bioinformatics 2023; 24:354. [PMID: 37735350 PMCID: PMC10512525 DOI: 10.1186/s12859-023-05470-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 09/08/2023] [Indexed: 09/23/2023] Open
Abstract
BACKGROUND Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.
Collapse
Affiliation(s)
- Amira Al-Aamri
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Syafiq Kamarul Azman
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Gihan Daw Elbait
- Department of Biology, College of Arts and Sciences, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Habiba Alsafar
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
- Center for Biotechnology (BTC), Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
2
|
Murata MM, Giuliano AE, Tanaka H. Genome-Wide Analysis of Palindrome Formation with Next-Generation Sequencing (GAPF-Seq) and a Bioinformatics Pipeline for Assessing De Novo Palindromes in Cancer Genomes. Methods Mol Biol 2023; 2660:13-22. [PMID: 37191787 DOI: 10.1007/978-1-0716-3163-8_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
DNA palindromes are a type of chromosomal aberration that appears frequently during tumorigenesis. They are characterized by sequences of nucleotides that are identical to their reverse complements and often arise due to illegitimate repair of DNA double-strand breaks, fusion of telomeres, or stalled replication forks, all of which are common adverse early events in cancer. Here, we describe the protocol for enriching palindromes from genomic DNA sources with low-input DNA amounts and detail a bioinformatics tool for assessing the enrichment and location of de novo palindrome formation from low-coverage whole-genome sequencing data.
Collapse
Affiliation(s)
- Michael M Murata
- Department of Surgery, Cedars-Sinai Medical Center, West Hollywood, CA, USA.
| | - Armando E Giuliano
- Department of Surgery, Cedars-Sinai Medical Center, West Hollywood, CA, USA
- Department of Surgery, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, West Hollywood, CA, USA
| | - Hisashi Tanaka
- Department of Surgery, Cedars-Sinai Medical Center, West Hollywood, CA, USA.
- Department of Surgery, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, West Hollywood, CA, USA.
- Departments of Surgery and Biomedical Sciences, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, West Hollywood, CA, USA.
| |
Collapse
|
3
|
Merhi G, Koweyes J, Salloum T, Khoury CA, Haidar S, Tokajian S. SARS-CoV-2 genomic epidemiology: data and sequencing infrastructure. Future Microbiol 2022; 17:1001-1007. [PMID: 35899481 PMCID: PMC9332909 DOI: 10.2217/fmb-2021-0207] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background: Genomic surveillance of SARS-CoV-2 is critical in monitoring viral lineages. Available data reveal a significant gap between low- and middle-income countries and the rest of the world. Methods: The SARS-CoV-2 sequencing costs using the Oxford Nanopore MinION device and hardware prices for data computation in Lebanon were estimated and compared with those in developed countries. SARS-CoV-2 genomes deposited on the Global Initiative on Sharing All Influenza Data per 1000 COVID-19 cases were determined per country. Results: Sequencing costs in Lebanon were significantly higher compared with those in developed countries. Low- and middle-income countries showed limited sequencing capabilities linked to the lack of support, high prices, long delivery delays and limited availability of trained personnel. Conclusion: The authors recommend the mobilization of funds to develop whole-genome sequencing-based surveillance platforms and the implementation of genomic epidemiology to better identify and track outbreaks, leading to appropriate and mindful interventions. Lebanon and other low- and middle-income countries have limited sequencing capabilities. Sequencing costs using MinION in Lebanon were higher than the approximate sequencing costs in developed countries. The challenges faced by low- and middle-income countries include lack of support, few established sequencing facilities, high prices, long delivery delays and the limited availability of trained personnel. There is a need to focus on the development of whole-genome sequencing-based surveillance platforms and the implementation of genomic epidemiology to improve sequencing efforts in many resource-limited settings and to contain and prevent future pandemic-level outbreaks. Sequencing costs of #SARS-CoV-2 in Lebanon are higher than those in developed countries. #LMICs have limited #sequencing capabilities. Whole-genome sequencing-based surveillance platforms and the implementation of genomic epidemiology could improve sequencing efforts.
Collapse
Affiliation(s)
- Georgi Merhi
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Jad Koweyes
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Tamara Salloum
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Charbel Al Khoury
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Siwar Haidar
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| | - Sima Tokajian
- Department of Natural Sciences, School of Arts & Sciences, Lebanese American University, Byblos, Lebanon
| |
Collapse
|
4
|
Adams DC, Collyer ML. Consilience of methods for phylogenetic analysis of variance. Evolution 2022; 76:1406-1419. [PMID: 35522593 PMCID: PMC9544334 DOI: 10.1111/evo.14512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 03/22/2022] [Indexed: 01/21/2023]
Abstract
Simulation-based and permutation-based inferential methods are commonplace in phylogenetic comparative methods, especially as evolutionary data have become more complex and parametric methods more limited for their analysis. Both approaches simulate many random outcomes from a null model to empirically generate sampling distributions of statistics. Although simulation-based and permutation-based methods seem commensurate in purpose, results from analysis of variance (ANOVA) based on the distributions of random F-statistics produced by these methods can be quite different in practice. Differences could be from either the null-model process that generates variation across many simulations or random permutations of the data, or different estimation methods for linear model coefficients and statistics. Unfortunately, because the null-model process and coefficient estimation are intrinsically linked in phylogenetic ANOVA methods, the precise reason for methodological differences has not been fully considered. Here we show that the null-model processes of phylogenetic simulation and randomizing residuals in a permutation procedure are indeed commensurate, and that both also produce results consistent with parametric ANOVA, for cases where parametric ANOVA is possible. We also provide results that caution against using ordinary least-squares estimation along with phylogenetic simulation; a typical phylogenetic ANOVA implementation.
Collapse
Affiliation(s)
- Dean C. Adams
- Department of Ecology, Evolution, and Organismal BiologyIowa State UniversityAmesIowaUSA
| | | |
Collapse
|
5
|
Shao D, Kellogg GD, Nematbakhsh A, Kuntala PK, Mahony S, Pugh BF, Lai WKM. PEGR: a flexible management platform for reproducible epigenomic and genomic research. Genome Biol 2022; 23:99. [PMID: 35440038 PMCID: PMC9016988 DOI: 10.1186/s13059-022-02671-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 04/07/2022] [Indexed: 11/27/2022] Open
Abstract
Reproducibility is a significant challenge in (epi)genomic research due to the complexity of experiments composed of traditional biochemistry and informatics. Recent advances have exacerbated this as high-throughput sequencing data is generated at an unprecedented pace. Here, we report the development of a Platform for Epi-Genomic Research (PEGR), a web-based project management platform that tracks and quality controls experiments from conception to publication-ready figures, compatible with multiple assays and bioinformatic pipelines. It supports rigor and reproducibility for biochemists working at the bench, while fully supporting reproducibility and reliability for bioinformaticians through integration with the Galaxy platform.
Collapse
Affiliation(s)
- Danying Shao
- Institute for Computational and Data Sciences, Pennsylvania State University, University Park, PA, 16802, USA
| | - Gretta D Kellogg
- Cornell Institute of Biotechnology, Cornell University, Ithaca, NY, 14850, USA
| | - Ali Nematbakhsh
- Cornell Institute of Biotechnology, Cornell University, Ithaca, NY, 14850, USA
| | - Prashant K Kuntala
- Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | - Shaun Mahony
- Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, 16802, USA
| | - B Franklin Pugh
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, 14850, USA
| | - William K M Lai
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, 14850, USA. .,Department of Computational Biology, Cornell University, Ithaca, NY, 14850, USA.
| |
Collapse
|
6
|
Jain S, Saxena A, Hesarur S, Bhadhadhara K, Bharti N, Kasibhatla SM, Sonavane U, Joshi R. GenoVault: a cloud based genomics repository. BioData Min 2021; 14:36. [PMID: 34325724 PMCID: PMC8319889 DOI: 10.1186/s13040-021-00268-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 07/02/2021] [Indexed: 11/15/2022] Open
Abstract
GenoVault is a cloud-based repository for handling Next Generation Sequencing (NGS) data. It is developed using OpenStack-based private cloud with various services like keystone for authentication, cinder for block storage, neutron for networking and nova for managing compute instances for the Cloud. GenoVault uses object-based storage, which enables data to be stored as objects instead of files or blocks for faster retrieval from different distributed object nodes. Along with a web-based interface, a JavaFX-based desktop client has also been developed to meet the requirements of large file uploads that are usually seen in NGS datasets. Users can store files in their respective object-based storage areas and the metadata provided by the user during file uploads is used for querying the database. GenoVault repository is designed taking into account future needs and hence can scale both vertically and horizontally using OpenStack-based cloud features. Users have an option to make the data shareable to the public or restrict the access as private. Data security is ensured as every container is a separate entity in object-based storage architecture which is also supported by Secure File Transfer Protocol (SFTP) for data upload and download. The data is uploaded by the user in individual containers that include raw read files (fastq), processed alignment files (bam, sam, bed) and the output of variation detection (vcf). GenoVault architecture allows verification of the data in terms of integrity and authentication before making it available to collaborators as per the user’s permissions. GenoVault is useful for maintaining the organization-wide NGS data generated in various labs which is not yet published and submitted to public repositories like NCBI. GenoVault also provides support to share NGS data among the collaborating institutions. GenoVault can thus manage vast volumes of NGS data on any OpenStack-based private cloud.
Collapse
Affiliation(s)
- Sankalp Jain
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Amit Saxena
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Suprit Hesarur
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Kirti Bhadhadhara
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Neeraj Bharti
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | | | - Uddhavesh Sonavane
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India
| | - Rajendra Joshi
- HPC-M&BA) Group, Centre for Development of Advanced Computing (C-DAC), Pune, MH, 411008, India.
| |
Collapse
|
7
|
Gutiérrez-Sacristán A, De Niz C, Kothari C, Kong SW, Mandl KD, Avillach P. GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets. Brief Bioinform 2021; 22:55-65. [PMID: 32249310 PMCID: PMC7820848 DOI: 10.1093/bib/bbaa033] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 01/31/2020] [Indexed: 12/17/2022] Open
Abstract
Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient's individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine's main objective-ensuring the optimum diagnosis, treatment and prognosis for each individual-investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data-and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).
Collapse
Affiliation(s)
| | - Carlos De Niz
- Department of Biomedical Informatics, Harvard Medical School
| | - Cartik Kothari
- Department of Biomedical Informatics, Harvard Medical School
| | - Sek Won Kong
- Department of Biomedical Informatics, Harvard Medical School; Computational Health Informatics Program, Boston Children's Hospital
| | - Kenneth D Mandl
- Department of Biomedical Informatics, Harvard Medical School; Computational Health Informatics Program, Boston Children's Hospital
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School; Computational Health Informatics Program, Boston Children's Hospital
| |
Collapse
|
8
|
Kautsar SA, van der Hooft JJJ, de Ridder D, Medema MH. BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters. Gigascience 2021; 10:giaa154. [PMID: 33438731 PMCID: PMC7804863 DOI: 10.1093/gigascience/giaa154] [Citation(s) in RCA: 76] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 10/29/2020] [Accepted: 11/29/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). RESULTS Here, we introduce BiG-SLiCE, a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion. We used BiG-SLiCE to analyze 1,225,071 BGCs collected from 209,206 publicly available microbial genomes and metagenome-assembled genomes within 10 days on a typical 36-core CPU server. We demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential. BiG-SLiCE also provides a "query mode" that can efficiently place newly sequenced BGCs into previously computed GCFs, plus a powerful output visualization engine that facilitates user-friendly data exploration. CONCLUSIONS BiG-SLiCE opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry. BiG-SLiCE is available via https://github.com/medema-group/bigslice.
Collapse
Affiliation(s)
- Satria A Kautsar
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, sThe Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|
9
|
Fahy S, O'Connor J, O'Brien D, Fitzpatrick L, O'Connor M, Crowley J, Bernard M, Sleator R, Lucey B. Carbapenemase screening in an Irish tertiary referral hospital: Best practice, or can we do better? Infect Prev Pract 2020; 2:100100. [PMID: 34368728 PMCID: PMC8335925 DOI: 10.1016/j.infpip.2020.100100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 10/26/2020] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Carbapenems are a family of end line antibiotics with increasing levels of resistance that are a cause for concern. AIM To ascertain whether the CPE screening programme employed in an acute tertiary hospital is fit for purpose. METHOD We outlined the current working algorithm employed using a universal screening programme over a 26-month screening period. Rectal swabs are cultured on arrival. Those with suspicious growth are further investigated using NG-Carba 5 lateral flow tests and Vitek 2.0 sensitivity cards. These practices were compared with NHS guidelines. FINDINGS & CONCLUSIONS In all, 53 true positives were detected from 45 patients since the screening was implemented in early 2018 (46 OXA-48, 6 KPC, 1 NDM). As the rate of screening increased, the number of positive screens decreased over time. There were a lot of similarities between the HSE guidelines and the published NHS CPE toolkit. It was evident that there is no standard practice being employed across all hospitals. Comparing the MUH to national guidelines it appears to be quicker and more effective with universal screening in place at reducing the potential contacts and identifying carriers. Cost analysis indicates that the need to confirm all positive strains in a reference lab is costly, unnecessary and time consuming. There are adequate confirmatory tests available in-house for routine positive screens. It was concluded that infection prevention and control are key to identifying and controlling possible outbreaks in a hospital setting.
Collapse
Affiliation(s)
- S. Fahy
- Department of Clinical Microbiology, Mercy University Hospital, Cork, Ireland
- Department of Biological Sciences, Cork Institute of Technology, Bishopstown, Cork, Ireland
| | - J.A. O'Connor
- Department of Clinical Microbiology, Mercy University Hospital, Cork, Ireland
| | - D. O'Brien
- Department of Clinical Microbiology, Mercy University Hospital, Cork, Ireland
| | - L. Fitzpatrick
- Department of Clinical Microbiology, Mercy University Hospital, Cork, Ireland
| | - M. O'Connor
- Infection Prevention & Control Department, Mercy University Hospital, Cork, Ireland
| | - J. Crowley
- Infection Prevention & Control Department, Mercy University Hospital, Cork, Ireland
| | - M. Bernard
- Infection Prevention & Control Department, Mercy University Hospital, Cork, Ireland
| | - R.D. Sleator
- Department of Biological Sciences, Cork Institute of Technology, Bishopstown, Cork, Ireland
| | - B. Lucey
- Department of Biological Sciences, Cork Institute of Technology, Bishopstown, Cork, Ireland
| |
Collapse
|
10
|
Vlachakis D, Papageorgiou L, Papadaki A, Georga M, Kossida S, Eliopoulos E. An updated evolutionary study of the Notch family reveals a new ancient origin and novel invariable motifs as potential pharmacological targets. PeerJ 2020; 8:e10334. [PMID: 33194454 PMCID: PMC7649014 DOI: 10.7717/peerj.10334] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2020] [Accepted: 10/19/2020] [Indexed: 01/02/2023] Open
Abstract
Notch family proteins play a key role in a variety of developmental processes by controlling cell fate decisions and operating in a great number of biological processes in several organ systems, such as hematopoiesis, somatogenesis, vasculogenesis, neurogenesis and homeostasis. The Notch signaling pathway is crucial for the majority of developmental programs and regulates multiple pathogenic processes. Notch family receptors' activation has been largely related to its multiple effects in sustaining oncogenesis. The Notch signaling pathway constitutes an ancient and conserved mechanism for cell to cell communication. Much of what is known about Notch family proteins function comes from studies done in Caenorhabditis Elegans and Drosophila Melanogaster. Although, human Notch homologs had also been identified, the molecular mechanisms which modulate the Notch signaling pathway remained substantially unknown. In this study, an updated evolutionary analysis of the Notch family members among 603 different organisms of all kingdoms, from bacteria to humans, was performed in order to discover key regions that have been conserved throughout evolution and play a major role in the Notch signaling pathway. The major goal of this study is the presentation of a novel updated phylogenetic tree for the Notch family as a reliable phylogeny "map", in order to correlate information of the closely related members and identify new possible pharmacological targets that can be used in pathogenic cases, including cancer.
Collapse
Affiliation(s)
- Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
- University Research Institute of Maternal and Child Health & Precision Medicine, and UNESCO Chair on Adolescent Health Care, “Aghia Sophia” Children’s Hospital, National and Kapodistrian University of Athens, Athens, Greece
- Division of Endocrinology and Metabolism, Center of Clinical, Experimental Surgery and Translational Research, Biomedical Research Foundation of the Academy of Athens, Athens, Greece
| | - Louis Papageorgiou
- Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
| | - Ariadne Papadaki
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Maria Georga
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Sofia Kossida
- IMGT, The International ImMunoGeneTics Information System, Université de Montpellier, Laboratoire d’ImmunoGénétique Moléculaire and Institut de Génétique Humaine, University of Montpellier, Montpellier, France
| | - Elias Eliopoulos
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| |
Collapse
|
11
|
Shao D, Kellogg G, Mahony S, Lai W, Pugh BF. PEGR: a management platform for ChIP-based next generation sequencing pipelines. PEARC20 : PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING 2020 : CATCH THE WAVE : JULY 27-31, 2020, PORTLAND, OR VIRTUAL CONFERENCE. PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING (CONFERENCE) (2020 : ONLINE) 2020; 2020:285-292. [PMID: 35662897 PMCID: PMC9161112 DOI: 10.1145/3311790.3396621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/28/2023]
Abstract
There has been a rapid development in genome sequencing, including high-throughput next generation sequencing (NGS) technologies, automation in biological experiments, new bioinformatics tools and utilization of high-performance computing and cloud computing. ChIP-based NGS technologies, e.g. ChIP-seq and ChIP-exo, are widely used to detect the binding sites of DNA-interacting proteins in the genome and help us to have a deeper mechanistic understanding of genomic regulation. As sequencing data is generated at an unprecedented pace from the ChIP-based NGS pipelines, there is an urgent need for a metadata management system. To meet this need, we developed the Platform for Eukaryotic Genomic Regulation (PEGR), a web service platform that logs metadata for samples and sequencing experiments, manages the data processing workflows, and provides reporting and visualization. PEGR links together people, samples, protocols, DNA sequencers and bioinformatics computation. With the help of PEGR, scientists can have a more integrated understanding of the sequencing data and better understand the scientific mechanisms of genomic regulation. In this paper, we present the architecture and the major functionalities of PEGR. We also share our experience in developing this application and discuss the future directions.
Collapse
Affiliation(s)
- Danying Shao
- Pennsylvania State University, University Park, Pennsylvania
| | - Gretta Kellogg
- Pennsylvania State University, University Park, Pennsylvania
| | - Shaun Mahony
- Pennsylvania State University, University Park, Pennsylvania
| | - William Lai
- Pennsylvania State University, University Park, Pennsylvania
| | - B Franklin Pugh
- Pennsylvania State University, University Park, Pennsylvania
| |
Collapse
|
12
|
Chen T, Tyagi S. Integrative computational epigenomics to build data-driven gene regulation hypotheses. Gigascience 2020; 9:giaa064. [PMID: 32543653 PMCID: PMC7297091 DOI: 10.1093/gigascience/giaa064] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 05/25/2020] [Accepted: 05/26/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Diseases are complex phenotypes often arising as an emergent property of a non-linear network of genetic and epigenetic interactions. To translate this resulting state into a causal relationship with a subset of regulatory features, many experiments deploy an array of laboratory assays from multiple modalities. Often, each of these resulting datasets is large, heterogeneous, and noisy. Thus, it is non-trivial to unify these complex datasets into an interpretable phenotype. Although recent methods address this problem with varying degrees of success, they are constrained by their scopes or limitations. Therefore, an important gap in the field is the lack of a universal data harmonizer with the capability to arbitrarily integrate multi-modal datasets. RESULTS In this review, we perform a critical analysis of methods with the explicit aim of harmonizing data, as opposed to case-specific integration. This revealed that matrix factorization, latent variable analysis, and deep learning are potent strategies. Finally, we describe the properties of an ideal universal data harmonization framework. CONCLUSIONS A sufficiently advanced universal harmonizer has major medical implications, such as (i) identifying dysregulated biological pathways responsible for a disease is a powerful diagnostic tool; (2) investigating these pathways further allows the biological community to better understand a disease's mechanisms; and (3) precision medicine also benefits from developments in this area, particularly in the context of the growing field of selective epigenome editing, which can suppress or induce a desired phenotype.
Collapse
Affiliation(s)
- Tyrone Chen
- 25 Rainforest Walk, School of Biological Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Sonika Tyagi
- 25 Rainforest Walk, School of Biological Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
13
|
Hubbard A, Bomhoff M, Schmidt CJ. fRNAkenseq: a fully powered-by-CyVerse cloud integrated RNA-sequencing analysis tool. PeerJ 2020; 8:e8592. [PMID: 32461821 PMCID: PMC7231498 DOI: 10.7717/peerj.8592] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Accepted: 01/18/2020] [Indexed: 11/20/2022] Open
Abstract
Background Decreasing costs make RNA sequencing technologies increasingly affordable for biologists. However, many researchers who can now afford sequencing lack access to resources necessary for downstream analysis. This means that even as algorithms to process RNA-Seq data improve, many biologists still struggle to manage the sheer volume of data produced by next generation sequencing (NGS) technologies. Scalable bioinformatics tools that exploit multiple platforms are needed to democratize bioinformatics resources in the sequencing era. This is essential for equipping many research groups in the life sciences with the tools to process the increasingly unwieldy datasets they produce. Methods One strategy to address this challenge is to develop a modern generation of sequence analysis tools capable of seamless data sharing and communication. Such tools will provide interoperability through offerings of interlinked resources. Systems of interlinked, scalable resources, which often incorporate cloud data storage, are broadly referred to as cyberinfrastructure. Cyberinfrastructure integrated tools will help researchers to robustly analyze large scale datasets by efficiently sharing data burdens across a distributed architecture. Additionally, interoperability will allow emerging tools to cross-adapt features of existing tools. It is important that these tools are designed to be easy to use for biologists. Results We introduce fRNAkenseq, a powered-by-CyVerse RNA sequencing analysis tool that exhibits interoperability with other resources and meets the needs of biologists for comprehensive, easy to use RNA sequencing analysis. fRNAkenseq leverages a complex set of Application Programming Interfaces (APIs) associated with the NSF-funded cyberinfrastructure project, CyVerse, to execute FASTQ-to-differential expression RNA-Seq analyses. Integrating across bioinformatics platforms, fRNAkenseq also exploits cloud integration and cross-talk with another CyVerse associated tool, CoGe. fRNAkenseq offers novel features for the biologist such as more robust and comprehensive pipelines for enrichment than those currently available by default in a single tool, whether they are cloud-based or local installation. Importantly, cross-talk with CoGe allows fRNAkenseq users to execute RNA-Seq pipelines on an inventory of 47,000 archived genomes stored in CoGe or upload their own draft genome.
Collapse
Affiliation(s)
- Allen Hubbard
- Donald Danforth Plant Science Center, Saint Louis, MO, USA
| | - Matthew Bomhoff
- Department of Plant and Soil Sciences, University of Arizona, Tucson, AZ, USA
| | - Carl J Schmidt
- Department of Animal and Food Sciences, University of Delaware, Newark, DE, USA
| |
Collapse
|
14
|
Goh WWB, Wong L. The Birth of Bio-data Science: Trends, Expectations, and Applications. GENOMICS, PROTEOMICS & BIOINFORMATICS 2020; 18:5-15. [PMID: 32428604 PMCID: PMC7393550 DOI: 10.1016/j.gpb.2020.01.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 12/02/2019] [Accepted: 02/26/2020] [Indexed: 12/23/2022]
Affiliation(s)
- Wilson Wen Bin Goh
- (1)School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore.
| | - Limsoon Wong
- (2)Department of Computer Science, National University of Singapore, Singapore 117417, Singapore.
| |
Collapse
|
15
|
Shi L, Wang Z. Computational Strategies for Scalable Genomics Analysis. Genes (Basel) 2019; 10:E1017. [PMID: 31817630 PMCID: PMC6947637 DOI: 10.3390/genes10121017] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 12/01/2019] [Accepted: 12/03/2019] [Indexed: 12/14/2022] Open
Abstract
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.
Collapse
Affiliation(s)
- Lizhen Shi
- Department of Computer Science, Florida State University, Tallahassee, FL 32304, USA;
| | - Zhong Wang
- US Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- School of Natural Sciences, University of California at Merced, Merced, CA 95343, USA
| |
Collapse
|
16
|
Suranova TG, Suvorov GN. [Storage, access and protection of full genome sequencing data in Russia and foreign countries: practical aspect.]. Klin Lab Diagn 2019; 64:578-584. [PMID: 31610112 DOI: 10.18821/0869-2084-2019-64-9-578-584] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2019] [Accepted: 09/20/2019] [Indexed: 11/17/2022]
Abstract
The relevance of the chosen topic is due to the need to resolve legal problems in the field of observance of human and civil rights and freedoms when storing, accessing and protecting full genome sequencing data. The purpose of this study is the formation of conceptual criteria on the basis of which a new model of regulatory regulation of this sphere of public relations will be built. To achieve this goal, the tasks of studying the regulatory legal acts in force in Russia and a number of foreign countries were solved. General scientific, private-scientific and special methods of scientific knowledge (system-structural, formal-legal) were used. In order to formulate conceptual criteria of practical importance for storing access and protecting genome-wide sequencing data in Russia and foreign countries, it was proposed to develop clarifying characteristics or gradation of human and civil rights and freedoms in the context of realization of public state interests. It is also necessary to unify the content of the conceptual apparatus of normative acts taking into account the peculiarities of genetic information, work out the procedure for accessing data, and provide for a system of its depersonification. For the first time, the authors substantiate the need to transform the content of the human rights declared by the state to life, freedom, personal and family secrets, and others with the development of new technologies in the field of DNA scanning. The basic criteria that are of practical importance for the storage, access and protection of genome-wide sequencing data indicate the need to improve normative concepts, establish categories of persons with the right to access such data, normatively fix the conditions for observing an anonymous survey, and also refuse to get acquainted with the results , to develop mechanisms for the depersonification of the obtained genetic information).
Collapse
Affiliation(s)
- T G Suranova
- Federal Research and Clinical Center of the Federal Medical-Biological Agency, 125371, Moscow, Russia
| | - G N Suvorov
- Federal Research and Clinical Center of the Federal Medical-Biological Agency, 125371, Moscow, Russia
| |
Collapse
|