1
|
Zhang T, Yin Z, Xu X, Yan L, Zhu F, Duan X, Schmidt B, Liu W. RabbitSketch: a high-performance sketching library for genome analysis. Bioinformatics 2025; 41:btaf249. [PMID: 40286290 PMCID: PMC12054975 DOI: 10.1093/bioinformatics/btaf249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Revised: 03/31/2025] [Accepted: 04/24/2025] [Indexed: 04/29/2025] Open
Abstract
SUMMARY We present RabbitSketch, a highly optimized library of sketching algorithms such as MinHash, OrderMinHash, and HyperLogLog that can exploit the power of modern multi-core CPUs. It provides significant speedups compared to existing implementations, ranging from 2.30× to 49.55×, as well as flexible and easy-to-use interfaces for both Python and C++. As a result, the similarity analysis of 455GB genomic data can be completed in only 5 minutes using RabbitSketch with merely 20 lines of Python code. As a case study, we enhanced RabbitTClust by integrating RabbitSketch's Kssd algorithm, resulting in a 1.54× speedup with no loss in accuracy. AVAILABILITY AND IMPLEMENTATION RabbitSketch is available at https://github.com/RabbitBio/RabbitSketch with an archived version at Zenodo: https://doi.org/10.5281/zenodo.14903962. Detailed API documentation is available at https://rabbitsketch.readthedocs.io/en/latest.
Collapse
Affiliation(s)
- Tong Zhang
- School of Software, Shandong University, Jinan 250101, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan 250101, China
| | - Xiaoming Xu
- School of Software, Shandong University, Jinan 250101, China
| | - Lifeng Yan
- School of Software, Shandong University, Jinan 250101, China
| | - Fangjin Zhu
- School of Software, Shandong University, Jinan 250101, China
| | - Xiaohui Duan
- School of Software, Shandong University, Jinan 250101, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz 55128, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan 250101, China
| |
Collapse
|
2
|
Wang A. Conceptual breakthroughs of the long noncoding RNA functional system and its endogenous regulatory role in the cancerous regime. EXPLORATION OF TARGETED ANTI-TUMOR THERAPY 2024; 5:170-186. [PMID: 38464381 PMCID: PMC10918237 DOI: 10.37349/etat.2024.00211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 12/18/2023] [Indexed: 03/12/2024] Open
Abstract
Long noncoding RNAs (lncRNAs) derived from noncoding regions in the human genome were once regarded as junks with no biological significance, but recent studies have shown that these molecules are highly functional, prompting an explosion of studies on their biology. However, these recent efforts have only begun to recognize the biological significance of a small fraction (< 1%) of the lncRNAs. The basic concept of these lncRNA functions remains controversial. This controversy arises primarily from conventional biased observations based on limited datasets. Fortunately, emerging big data provides a promising path to circumvent conventional bias to understand an unbiased big picture of lncRNA biology and advance the fundamental principles of lncRNA biology. This review focuses on big data studies that break through the critical concepts of the lncRNA functional system and its endogenous regulatory roles in all cancers. lncRNAs have unique functional systems distinct from proteins, such as transcriptional initiation and regulation, and they abundantly interact with mitochondria and consume less energy. lncRNAs, rather than proteins as traditionally thought, function as the most critical endogenous regulators of all cancers. lncRNAs regulate the cancer regulatory regime by governing the endogenous regulatory network of all cancers. This is accomplished by dominating the regulatory network module and serving as a key hub and top inducer. These critical conceptual breakthroughs lay a blueprint for a comprehensive functional picture of the human genome. They also lay a blueprint for combating human diseases that are regulated by lncRNAs.
Collapse
Affiliation(s)
- Anyou Wang
- Feinstone Center for Genomic Research, University of Memphis, Memphis, TN 38152, USA
| |
Collapse
|
3
|
Correia K, Walker R, Pittenger C, Fields C. A comparison of machine learning methods for quantifying self-grooming behavior in mice. Front Behav Neurosci 2024; 18:1340357. [PMID: 38347909 PMCID: PMC10859524 DOI: 10.3389/fnbeh.2024.1340357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 01/10/2024] [Indexed: 02/15/2024] Open
Abstract
Background As machine learning technology continues to advance and the need for standardized behavioral quantification grows, commercial and open-source automated behavioral analysis tools are gaining prominence in behavioral neuroscience. We present a comparative analysis of three behavioral analysis pipelines-DeepLabCut (DLC) and Simple Behavioral Analysis (SimBA), HomeCageScan (HCS), and manual scoring-in measuring repetitive self-grooming among mice. Methods Grooming behavior of mice was recorded at baseline and after water spray or restraint treatments. Videos were processed and analyzed in parallel using 3 methods (DLC/SimBA, HCS, and manual scoring), quantifying both total number of grooming bouts and total grooming duration. Results Both treatment conditions (water spray and restraint) resulted in significant elevation in both total grooming duration and number of grooming bouts. HCS measures of grooming duration were significantly elevated relative to those derived from manual scoring: specifically, HCS tended to overestimate duration at low levels of grooming. DLC/SimBA duration measurements were not significantly different than those derived from manual scoring. However, both SimBA and HCS measures of the number of grooming bouts were significantly different than those derived from manual scoring; the magnitude and direction of the difference depended on treatment condition. Conclusion DLC/SimBA provides a high-throughput pipeline for quantifying grooming duration that correlates well with manual scoring. However, grooming bout data derived from both DLC/SimBA and HCS did not reliably estimate measures obtained via manual scoring.
Collapse
Affiliation(s)
- Kassi Correia
- Department of Psychiatry, Yale School of Medicine, Yale University, New Haven, CT, United States
| | - Raegan Walker
- Department of Psychiatry, Yale School of Medicine, Yale University, New Haven, CT, United States
| | | | - Christopher Fields
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, United States
| |
Collapse
|
4
|
Aslam I, Shah S, Jabeen S, ELAffendi M, A Abdel Latif A, Ul Haq N, Ali G. A CNN based m5c RNA methylation predictor. Sci Rep 2023; 13:21885. [PMID: 38081880 PMCID: PMC10713599 DOI: 10.1038/s41598-023-48751-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 11/29/2023] [Indexed: 12/18/2023] Open
Abstract
Post-transcriptional modifications of RNA play a key role in performing a variety of biological processes, such as stability and immune tolerance, RNA splicing, protein translation and RNA degradation. One of these RNA modifications is m5c which participates in various cellular functions like RNA structural stability and translation efficiency, got popularity among biologists. By applying biological experiments to detect RNA m5c methylation sites would require much more efforts, time and money. Most of the researchers are using pre-processed RNA sequences of 41 nucleotides where the methylated cytosine is in the center. Therefore, it is possible that some of the information around these motif may have lost. The conventional methods are unable to process the RNA sequence directly due to high dimensionality and thus need optimized techniques for better features extraction. To handle the above challenges the goal of this study is to employ an end-to-end, 1D CNN based model to classify and interpret m5c methylated data sites. Moreover, our aim is to analyze the sequence in its full length where the methylated cytosine may not be in the center. The evaluation of the proposed architecture showed a promising results by outperforming state-of-the-art techniques in terms of sensitivity and accuracy. Our model achieve 96.70% sensitivity and 96.21% accuracy for 41 nucleotides sequences while 96.10% accuracy for full length sequences.
Collapse
Affiliation(s)
- Irum Aslam
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, 22060, KPK, Pakistan
| | - Sajid Shah
- EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Rafha, Riyadh, 12435, Saudi Arabia
| | - Saima Jabeen
- College of Engineering, AI Research Center, Alfaisal University, Riyadh, 50927, Saudi Arabia.
| | - Mohammed ELAffendi
- EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Rafha, Riyadh, 12435, Saudi Arabia
| | - Asmaa A Abdel Latif
- Public Health and Community Medicine Department (Industrial medicine and occupational health specialty, Faculty of Medicine, Menoufia University, Shibîn el Kôm, Egypt
| | - Nuhman Ul Haq
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, 22060, KPK, Pakistan
| | - Gauhar Ali
- EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Rafha, Riyadh, 12435, Saudi Arabia
| |
Collapse
|
5
|
Kowal K, Tkaczyk-Wlizło A, Jusiak M, Grzybowska-Szatkowska L, Ślaska B. Canis MitoSNP database: a functional tool useful for comparative analyses of human and canine mitochondrial genomes. J Appl Genet 2023; 64:515-520. [PMID: 37351774 PMCID: PMC10457218 DOI: 10.1007/s13353-023-00764-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 04/21/2023] [Accepted: 04/22/2023] [Indexed: 06/24/2023]
Abstract
Canis MitoSNP is a tool allowing assignment of each mitochondrial genomic position a corresponding position in the mitochondrial gene and in the structure of tRNA, rRNA, and protein. The main aim of this bioinformatic tool was to use data from other bioinformatic tools (TMHMM, SOPMA, tRNA-SCAN, RNAfold, ConSurf) for dog and human mitochondrial genes in order to shorten the time necessary for the analysis of the whole genome single nucleotide polymorphism (SNP) as well as amino acid and protein analyses. Each position in the canine mitochondrial genome is assigned a position in genes, in codons, an amino acid position in proteins, or a position in tRNA or rRNA molecules. Therefore, a user analysing changes in the canine and human mitochondrial genome does not need to extract the sequences of individual genes from the mitochondrial genome for analysis and there is no need to rewrite them into amino acid sequences to assess whether the change is synonymous or nonsynonymous. Canis mitoSNP allows the comparison between the human and canine mitochondrial genomes as well. The Clustal W alignment of the dog and human mitochondrial DNA reference sequences for each gene obtained from GenBank (NC_002008.4 dog, NC_012920.1 human) was performed in order to determine which position in the canine mitochondrial genome corresponds to the position in the human mitochondrial genome. This function may be useful for the comparative analyses. The tool is available at: https://canismitosnp.pl .
Collapse
Affiliation(s)
- Krzysztof Kowal
- Institute of Biological Bases of Animal Production, University of Life Sciences in Lublin, Akademicka 13 St., 20-950, Lublin, Poland
| | - Angelika Tkaczyk-Wlizło
- Institute of Biological Bases of Animal Production, University of Life Sciences in Lublin, Akademicka 13 St., 20-950, Lublin, Poland
| | - Marcin Jusiak
- Institute of Biological Bases of Animal Production, University of Life Sciences in Lublin, Akademicka 13 St., 20-950, Lublin, Poland
| | | | - Brygida Ślaska
- Institute of Biological Bases of Animal Production, University of Life Sciences in Lublin, Akademicka 13 St., 20-950, Lublin, Poland.
| |
Collapse
|
6
|
Mayo KR, Basford MA, Carroll RJ, Dillon M, Fullen H, Leung J, Master H, Rura S, Sulieman L, Kennedy N, Banks E, Bernick D, Gauchan A, Lichtenstein L, Mapes BM, Marginean K, Nyemba SL, Ramirez A, Rotundo C, Wolfe K, Xia W, Azuine RE, Cronin RM, Denny JC, Kho A, Lunt C, Malin B, Natarajan K, Wilkins CH, Xu H, Hripcsak G, Roden DM, Philippakis AA, Glazer D, Harris PA. The All of Us Data and Research Center: Creating a Secure, Scalable, and Sustainable Ecosystem for Biomedical Research. Annu Rev Biomed Data Sci 2023; 6:443-464. [PMID: 37561600 PMCID: PMC11157478 DOI: 10.1146/annurev-biodatasci-122120-104825] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
The All of Us Research Program's Data and Research Center (DRC) was established to help acquire, curate, and provide access to one of the world's largest and most diverse datasets for precision medicine research. Already, over 500,000 participants are enrolled in All of Us, 80% of whom are underrepresented in biomedical research, and data are being analyzed by a community of over 2,300 researchers. The DRC created this thriving data ecosystem by collaborating with engaged participants, innovative program partners, and empowered researchers. In this review, we first describe how the DRC is organized to meet the needs of this broad group of stakeholders. We then outline guiding principles, common challenges, and innovative approaches used to build the All of Us data ecosystem. Finally, we share lessons learned to help others navigate important decisions and trade-offs in building a modern biomedical data platform.
Collapse
Affiliation(s)
- Kelsey R Mayo
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Melissa A Basford
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Robert J Carroll
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Moira Dillon
- Verily Life Sciences, South San Francisco, California, USA
| | - Heather Fullen
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Jesse Leung
- Verily Life Sciences, South San Francisco, California, USA
| | - Hiral Master
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Shimon Rura
- Verily Life Sciences, South San Francisco, California, USA
| | - Lina Sulieman
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Nan Kennedy
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Eric Banks
- Data Sciences Platform, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - David Bernick
- Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Asmita Gauchan
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Lee Lichtenstein
- Data Sciences Platform, Broad Institute of Harvard and MIT, Cambridge, Massachusetts, USA
| | - Brandy M Mapes
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Kayla Marginean
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Steve L Nyemba
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Andrea Ramirez
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Charissa Rotundo
- Vanderbilt University Medical Center Enterprise Cybersecurity, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Keri Wolfe
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Weiyi Xia
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Romuladus E Azuine
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Robert M Cronin
- Department of Internal Medicine, The Ohio State University, Columbus, Ohio, USA
| | - Joshua C Denny
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Abel Kho
- Department of Medicine and Institute for Augmented Intelligence in Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Christopher Lunt
- The All of Us Research Program, National Institutes of Health, Bethesda, Maryland, USA
| | - Bradley Malin
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Consuelo H Wilkins
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, Connecticut, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Dan M Roden
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
- Department of Medicine, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Pharmacology, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | | | - David Glazer
- Verily Life Sciences, South San Francisco, California, USA
| | - Paul A Harris
- Deparment of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA;
| |
Collapse
|
7
|
Yan L, Yin Z, Zhang H, Zhao Z, Wang M, Müller A, Kallenborn F, Wichmann A, Wei Y, Niu B, Schmidt B, Liu W. RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data. Methods 2023; 216:39-50. [PMID: 37330158 DOI: 10.1016/j.ymeth.2023.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/26/2023] [Accepted: 06/12/2023] [Indexed: 06/19/2023] Open
Abstract
Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.
Collapse
Affiliation(s)
- Lifeng Yan
- School of Software, Shandong University, Jinan, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan, China.
| | - Hao Zhang
- School of Software, Shandong University, Jinan, China
| | - Zhan Zhao
- School of Software, Shandong University, Jinan, China
| | - Mingkai Wang
- School of Software, Shandong University, Jinan, China
| | - André Müller
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Felix Kallenborn
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Alexander Wichmann
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
8
|
Dall'Alba G, Casa PL, Abreu FPD, Notari DL, de Avila E Silva S. A Survey of Biological Data in a Big Data Perspective. BIG DATA 2022; 10:279-297. [PMID: 35394342 DOI: 10.1089/big.2020.0383] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The amount of available data is continuously growing. This phenomenon promotes a new concept, named big data. The highlight technologies related to big data are cloud computing (infrastructure) and Not Only SQL (NoSQL; data storage). In addition, for data analysis, machine learning algorithms such as decision trees, support vector machines, artificial neural networks, and clustering techniques present promising results. In a biological context, big data has many applications due to the large number of biological databases available. Some limitations of biological big data are related to the inherent features of these data, such as high degrees of complexity and heterogeneity, since biological systems provide information from an atomic level to interactions between organisms or their environment. Such characteristics make most bioinformatic-based applications difficult to build, configure, and maintain. Although the rise of big data is relatively recent, it has contributed to a better understanding of the underlying mechanisms of life. The main goal of this article is to provide a concise and reliable survey of the application of big data-related technologies in biology. As such, some fundamental concepts of information technology, including storage resources, analysis, and data sharing, are described along with their relation to biological data.
Collapse
Affiliation(s)
- Gabriel Dall'Alba
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
- Genome Science and Technology Program, Faculty of Science, The University of British Columbia, Vancouver, Canada
| | - Pedro Lenz Casa
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Fernanda Pessi de Abreu
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Daniel Luis Notari
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| | - Scheila de Avila E Silva
- Computational Biology and Bioinformatics Laboratory, Biotechnology Institute, Department of Life Sciences, University of Caxias do Sul, Caxias do Sul, Brazil
| |
Collapse
|
9
|
Gudur VY, Maheshwari S, Bhardwaj S, Acharyya A, Shafik R. Hardware-Algorithm Codesign for Fast and Energy Efficient Approximate String Matching on FPGA for Computational Biology. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2022; 2022:87-90. [PMID: 36086088 DOI: 10.1109/embc48229.2022.9870924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Myers bit-vector algorithm for approximate string matching (ASM) is a dynamic programming based approach that takes advantage of bit-parallel operations. It is one of the fastest algorithms to find the edit distance between two strings. In computational biology, ASM is used at various stages of the computational pipeline, including proteomics and genomics. The computationally intensive nature of the underlying algorithms for ASM operating on the large volume of data necessitates the acceleration of these algorithms. In this paper, we propose a novel ASM architecture based on Myers bit-vector algorithm for parallel searching of multiple query patterns in the biological databases. The proposed parallel architecture uses multiple processing engines and hardware/software codesign for an accelerated and energy-efficient design of ASM algorithm on hardware. In comparison with related literature, the proposed design achieves 22× better performance with a demonstrative energy efficiency of ∼ 500×109 cell updates per joule.
Collapse
|
10
|
Rocha M, Massarani L, Souza SJD, Vasconcelos ATRD. The past, present and future of genomics and bioinformatics: A survey of Brazilian scientists. Genet Mol Biol 2022; 45:e20210354. [PMID: 35671453 PMCID: PMC9169998 DOI: 10.1590/1678-4685-gmb-2021-0354] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/05/2022] [Indexed: 11/22/2022] Open
Abstract
Brazil has one of the highest rates of scientific production, occupying the ninth position among countries with genome-sequencing projects. Considering the rapid development of this research area and the diversity of professionals involved, the present study aims to understand the expectations, past experiences and the current scenario of Brazilian research in bioinformatics and genomics. The present research was carried out by analyzing the perceptions of 576 researchers in genomics and bioinformatics in Brazil through content and sentiment analysis techniques. This group of participants is equivalent to 48% of the members of the research community. The results suggest that most researchers have a positive perception of the potential of this research area. However, there is concern about the lack of funding for investing in equipment and professional training. As part of a wish list for the future, researchers highlighted the need for higher funding, formal education, and collaboration among research networks. When asked about genomics and bioinformatics in other countries, the participants recognize that sequencing technologies and infrastructure are more accessible, allowing better data volume expansion.
Collapse
Affiliation(s)
| | | | - Sandro José de Souza
- Universidade Federal do Rio Grande do Norte, Brazil; Universidade Federal do Rio Grande do Norte, Brazil; Sichuan University, China
| | | |
Collapse
|
11
|
Nunes IJG, Recamonde-Mendoza M, Feltes BC. Gene Expression Analysis Platform (GEAP): A highly customizable, fast, versatile and ready-to-use microarray analysis platform. Genet Mol Biol 2021; 45:e20210077. [PMID: 34927664 PMCID: PMC8754388 DOI: 10.1590/1678-4685-gmb-2021-0077] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 11/01/2021] [Indexed: 12/17/2022] Open
Abstract
There are still numerous challenges to be overcome in microarray data analysis because advanced, state-of-the-art analyses are restricted to programming users. Here we present the Gene Expression Analysis Platform, a versatile, customizable, optimized, and portable software developed for microarray analysis. GEAP was developed in C# for the graphical user interface, data querying, storage, results filtering and dynamic plotting, and R for data processing, quality analysis, and differential expression. Through a new automated system that identifies microarray file formats, retrieves contents, detects file corruption, and solves dependencies, GEAP deals with datasets independently of platform. GEAP covers 32 statistical options, supports quality assessment, differential expression from single and dual-channel experiments, and gene ontology. Users can explore results by different plots and filtering options. Finally, the entire data can be saved and organized through storage features, optimized for memory and data retrieval, with faster performance than R. These features, along with other new options, are not yet present in any microarray analysis software. GEAP accomplishes data analysis in a faster, straightforward, and friendlier way than other similar software, while keeping the flexibility for sophisticated procedures. By developing optimizations, unique customizations and new features, GEAP is destined for both advanced and non-programming users.
Collapse
Affiliation(s)
| | - Mariana Recamonde-Mendoza
- Universidade Federal do Rio Grande do Sul, Instituto de Informática, Porto Alegre, RS, Brazil.,Hospital de Clínicas de Porto Alegre (HCPA), Núcleo de Bioinformática, Porto Alegre, RS, Brazil
| | - Bruno César Feltes
- Universidade Federal do Rio Grande do Sul, Instituto de Informática, Porto Alegre, RS, Brazil.,Universidade Federal do Rio Grande do Sul, Instituto de Biociências, Departamento de Genética, Porto Alegre, RS, Brazil.,Universidade Federal do Rio Grande do Sul, Instituto de Biociências, Departamento de Biofísica, Porto Alegre, RS, Brazil
| |
Collapse
|
12
|
Ahmed H, Alarabi L, El-Sappagh S, Soliman H, Elmogy M. Genetic variations analysis for complex brain disease diagnosis using machine learning techniques: opportunities and hurdles. PeerJ Comput Sci 2021; 7:e697. [PMID: 34616886 PMCID: PMC8459785 DOI: 10.7717/peerj-cs.697] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Accepted: 08/05/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVES This paper presents an in-depth review of the state-of-the-art genetic variations analysis to discover complex genes associated with the brain's genetic disorders. We first introduce the genetic analysis of complex brain diseases, genetic variation, and DNA microarrays. Then, the review focuses on available machine learning methods used for complex brain disease classification. Therein, we discuss the various datasets, preprocessing, feature selection and extraction, and classification strategies. In particular, we concentrate on studying single nucleotide polymorphisms (SNP) that support the highest resolution for genomic fingerprinting for tracking disease genes. Subsequently, the study provides an overview of the applications for some specific diseases, including autism spectrum disorder, brain cancer, and Alzheimer's disease (AD). The study argues that despite the significant recent developments in the analysis and treatment of genetic disorders, there are considerable challenges to elucidate causative mutations, especially from the viewpoint of implementing genetic analysis in clinical practice. The review finally provides a critical discussion on the applicability of genetic variations analysis for complex brain disease identification highlighting the future challenges. METHODS We used a methodology for literature surveys to obtain data from academic databases. Criteria were defined for inclusion and exclusion. The selection of articles was followed by three stages. In addition, the principal methods for machine learning to classify the disease were presented in each stage in more detail. RESULTS It was revealed that machine learning based on SNP was widely utilized to solve problems of genetic variation for complex diseases related to genes. CONCLUSIONS Despite significant developments in genetic diseases in the past two decades of the diagnosis and treatment, there is still a large percentage in which the causative mutation cannot be determined, and a final genetic diagnosis remains elusive. So, we need to detect the variations of the genes related to brain disorders in the early disease stages.
Collapse
Affiliation(s)
- Hala Ahmed
- Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Louai Alarabi
- Department of Computer Science, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Shaker El-Sappagh
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Spain
- Information Systems Department, Faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt
| | - Hassan Soliman
- Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| | - Mohammed Elmogy
- Information Technology Department, Faculty of Computers and Information, Mansoura University, Mansoura, Egypt
| |
Collapse
|
13
|
Subrahmanya SVG, Shetty DK, Patil V, Hameed BMZ, Paul R, Smriti K, Naik N, Somani BK. The role of data science in healthcare advancements: applications, benefits, and future prospects. Ir J Med Sci 2021; 191:1473-1483. [PMID: 34398394 PMCID: PMC9308575 DOI: 10.1007/s11845-021-02730-z] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Accepted: 07/28/2021] [Indexed: 11/27/2022]
Abstract
Data science is an interdisciplinary field that extracts knowledge and insights from many structural and unstructured data, using scientific methods, data mining techniques, machine-learning algorithms, and big data. The healthcare industry generates large datasets of useful information on patient demography, treatment plans, results of medical examinations, insurance, etc. The data collected from the Internet of Things (IoT) devices attract the attention of data scientists. Data science provides aid to process, manage, analyze, and assimilate the large quantities of fragmented, structured, and unstructured data created by healthcare systems. This data requires effective management and analysis to acquire factual results. The process of data cleansing, data mining, data preparation, and data analysis used in healthcare applications is reviewed and discussed in the article. The article provides an insight into the status and prospects of big data analytics in healthcare, highlights the advantages, describes the frameworks and techniques used, briefs about the challenges faced currently, and discusses viable solutions. Data science and big data analytics can provide practical insights and aid in the decision-making of strategic decisions concerning the health system. It helps build a comprehensive view of patients, consumers, and clinicians. Data-driven decision-making opens up new possibilities to boost healthcare quality.
Collapse
Affiliation(s)
- Sri Venkat Gunturi Subrahmanya
- Department of Electrical and Electronics Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Dasharathraj K Shetty
- Department of Humanities and Management, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Vathsala Patil
- Department of Oral Medicine and Radiology, Manipal College of Dental Sciences, Manipal, Manipal Academy of Higher Education, Manipal, Karnataka, India.
| | - B M Zeeshan Hameed
- Department of Urology, Father Muller Medical College, Mangalore, Karnataka, India
| | - Rahul Paul
- Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA, USA
| | - Komal Smriti
- Department of Oral Medicine and Radiology, Manipal College of Dental Sciences, Manipal, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Nithesh Naik
- Department of Mechanical and Manufacturing Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
| | - Bhaskar K Somani
- Department of Urology, University Hospital Southampton NHS Trust, Southampton, UK
| |
Collapse
|
14
|
Xiang S, Li J, Shen J, Zhao Y, Wu X, Li M, Yang X, Kaboli PJ, Du F, Zheng Y, Wen Q, Cho CH, Yi T, Xiao Z. Identification of Prognostic Genes in the Tumor Microenvironment of Hepatocellular Carcinoma. Front Immunol 2021; 12:653836. [PMID: 33897701 PMCID: PMC8059369 DOI: 10.3389/fimmu.2021.653836] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 02/10/2021] [Indexed: 12/12/2022] Open
Abstract
Background: Hepatocellular carcinoma (HCC) is one of the most common malignant tumors in the world. The efficacy of immunotherapy usually depends on the interaction of immunomodulation in the tumor microenvironment (TME). This study aimed to explore the potential stromal-immune score-based prognostic genes related to immunotherapy in HCC through bioinformatics analysis. Methods: ESTIMATE algorithm was applied to calculate the immune/stromal/Estimate scores and tumor purity of HCC using the Cancer Genome Atlas (TCGA) transcriptome data. Functional enrichment analysis of differentially expressed genes (DEGs) was analyzed by the Database for Annotation, Visualization, and Integrated Discovery database (DAVID). Univariate and multivariate Cox regression analysis and least absolute shrinkage and selection operator (LASSO) regression analysis were performed for prognostic gene screening. The expression and prognostic value of these genes were further verified by KM-plotter database and the Human Protein Atlas (HPA) database. The correlation of the selected genes and the immune cell infiltration were analyzed by single sample gene set enrichment analysis (ssGSEA) algorithm and Tumor Immune Estimation Resource (TIMER). Results: Data analysis revealed that higher immune/stromal/Estimate scores were significantly associated with better survival benefits in HCC within 7 years, while the tumor purity showed a reverse trend. DEGs based on both immune and stromal scores primarily affected the cytokine–cytokine receptor interaction signaling pathway. Among the DEGs, three genes (CASKIN1, EMR3, and GBP5) were found most significantly associated with survival. Moreover, the expression levels of CASKIN1, EMR3, and GBP5 genes were significantly correlated with immune/stromal/Estimate scores or tumor purity and multiple immune cell infiltration. Among them, GBP5 genes were highly related to immune infiltration. Conclusion: This study identified three key genes which were related to the TME and had prognostic significance in HCC, which may be promising markers for predicting immunotherapy outcomes.
Collapse
Affiliation(s)
- Shixin Xiang
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Jing Li
- Department of Oncology and Hematology, Hospital (T.C.M) Affiliated to Southwest Medical University, Luzhou, China
| | - Jing Shen
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Yueshui Zhao
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Xu Wu
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Mingxing Li
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Xiao Yang
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Parham Jabbarzadeh Kaboli
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Fukuan Du
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China
| | - Yuan Zheng
- Neijiang Health and Health Vocational College, Neijiang, China
| | - Qinglian Wen
- Department of Oncology, Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Chi Hin Cho
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,South Sichuan Institute of Translational Medicine, Luzhou, China.,Faculty of Medicine, School of Biomedical Sciences, The Chinese University of Hong Kong, Hong Kong, China
| | - Tao Yi
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Zhangang Xiao
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| |
Collapse
|
15
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
16
|
Integrative Analysis of MAPK14 as a Potential Biomarker for Cardioembolic Stroke. BIOMED RESEARCH INTERNATIONAL 2020; 2020:9502820. [PMID: 32879891 PMCID: PMC7448239 DOI: 10.1155/2020/9502820] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 07/09/2020] [Accepted: 07/15/2020] [Indexed: 01/22/2023]
Abstract
The aim of this study was to obtain the candidate genes and biomarkers that are significantly related to cardioembolic stroke (CS) by applying bioinformatics analysis. In accordance with the results of the weighted gene coexpression network analysis (WGCNA) in the GSE58294 dataset, 11 CS-related coexpression network modules were identified in this study. Correlation analysis showed that the black and pink modules are significantly associated with CS. A total of 18 core genes in the black module and one core gene in the pink module were determined. We then identified differentially expressed genes (DEGs) of CS at 3 h, 5 h, and 24 h postonset. After performing intersection, it was found that 311 genes were coexpressed at these three time points. These genes were majorly enriched in positive regulation of transferase activity and regulation of peptidase activity. The abovementioned coexpressed DEGs were subjected to protein-protein interaction analysis and subnetwork module analysis. Subsequently, we used cytoHubba to obtain 11 key genes from DEGs. The intersection of the core genes screened from WGCNA and the key genes selected from DEGs yielded the MAPK14 gene. The expression level of MAPK14 on the receiver operating characteristic (ROC) curves of CS at 3 h, 5 h, and 24 h showed that the area under the ROC curve (AUC) was 0.923, 0.934, and 0.941, respectively. In a nutshell, MAPK14 screened out by using WGCNA showed differential expression in CS. We conclude that MAPK14 can be used as a potential biological marker of CS and exhibits potential to predict the physiopathological condition of CS patients.
Collapse
|
17
|
Garcia-Fossa F, Gaal V, de Jesus MB. PyScratch: An ease of use tool for analysis of scratch assays. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 193:105476. [PMID: 32302889 DOI: 10.1016/j.cmpb.2020.105476] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Revised: 02/28/2020] [Accepted: 03/20/2020] [Indexed: 06/11/2023]
Abstract
BACKGROUND AND OBJECTIVE Image acquisition has greatly benefited from the automation of microscopes and has been followed by an increasing amount and complexity of data acquired. Here, we present the PyScratch, a new tool for processing spatial and temporal information from scratch assays. PyScratch is an open-source software implemented in Python that analyses the migration area in an automated fashion. METHODS The software was developed in Python. Wound healing assays were used to validate its performance. The images were acquired using Cytation 5™ during 60 h. Data were analyzed using One-Way ANOVA. RESULTS PyScratch performed a robust analysis of confluent cells, showing that high plating density affects cell migration. Additionally, PyScratch was approximately six times faster than a semi-automated analysis. CONCLUSIONS PyScratch offers a user-friendly interface allowing researches with little or no programming skills to perform quantitative analysis of in vitro scratch assays.
Collapse
Affiliation(s)
- Fernanda Garcia-Fossa
- Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas, Campinas, São Paulo, Brazil.
| | - Vladimir Gaal
- Applied Physics Department, University of Campinas, Campinas, São Paulo, Brazil.
| | - Marcelo Bispo de Jesus
- Department of Biochemistry and Tissue Biology, Institute of Biology, University of Campinas, Campinas, São Paulo, Brazil.
| |
Collapse
|
18
|
Zhou L, Huang W, Yu HF, Feng YJ, Teng X. Exploring TCGA database for identification of potential prognostic genes in stomach adenocarcinoma. Cancer Cell Int 2020; 20:264. [PMID: 32581654 PMCID: PMC7310509 DOI: 10.1186/s12935-020-01351-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Accepted: 06/15/2020] [Indexed: 02/15/2023] Open
Abstract
Background Stomach adenocarcinoma (STAD) is the fifth most prevalent cancer in the world and ranks third among cancer-related deaths worldwide. The tumour microenvironment (TME) plays an important role in tumorigenesis, development, and metastasis. Hence, we calculated the immune and stromal scores to find the potential prognosis-related genes in STAD using bioinformatics analysis. Methods The ESTIMATE algorithm was used to calculate the immune/stromal scores of the STAD samples. Functional enrichment analysis, protein–protein interaction (PPI) network analysis, and overall survival analysis were then performed on differential genes. And we validated these genes using data from the Gene Expression Omnibus database. Finally, we used the Human Protein Atlas (HPA) databases to verify these genes at the protein levels by IHC. Results Data analysis revealed correlation between stromal/immune scores and the TNM staging system. The top 10 core genes extracted from the PPI network, and primarily involved in immune responses, extracellular matrix, and cell adhesion. There are 31 genes have been validated with poor prognosis and 16 genes were upregulated in tumour tissues compared with normal tissues at the protein level. Conclusions In summary, we identified genes associated with the tumour microenvironment with prognostic implications in STAD, which may become potential therapeutic markers leading to better clinical outcomes.
Collapse
Affiliation(s)
- Lin Zhou
- School of Information Science and Technology, University of Science and Technology of China, Hefei, 230026 Anhui China
| | - Wei Huang
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Capital Medical University, Beijing, 100069 China
| | - He-Fen Yu
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Capital Medical University, Beijing, 100069 China
| | - Ya-Juan Feng
- School of Information Science and Technology, University of Science and Technology of China, Hefei, 230026 Anhui China
| | - Xu Teng
- Beijing Key Laboratory for Tumor Invasion and Metastasis, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Capital Medical University, Beijing, 100069 China
| |
Collapse
|
19
|
Hansen AW, Murugan M, Li H, Khayat MM, Wang L, Rosenfeld J, Andrews BK, Jhangiani SN, Coban Akdemir ZH, Sedlazeck FJ, Ashley-Koch AE, Liu P, Muzny DM, Davis EE, Katsanis N, Sabo A, Posey JE, Yang Y, Wangler MF, Eng CM, Sutton VR, Lupski JR, Boerwinkle E, Gibbs RA. A Genocentric Approach to Discovery of Mendelian Disorders. Am J Hum Genet 2019; 105:974-986. [PMID: 31668702 PMCID: PMC6849092 DOI: 10.1016/j.ajhg.2019.09.027] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 09/27/2019] [Indexed: 12/20/2022] Open
Abstract
The advent of inexpensive, clinical exome sequencing (ES) has led to the accumulation of genetic data from thousands of samples from individuals affected with a wide range of diseases, but for whom the underlying genetic and molecular etiology of their clinical phenotype remains unknown. In many cases, detailed phenotypes are unavailable or poorly recorded and there is little family history to guide study. To accelerate discovery, we integrated ES data from 18,696 individuals referred for suspected Mendelian disease, together with relatives, in an Apache Hadoop data lake (Hadoop Architecture Lake of Exomes [HARLEE]) and implemented a genocentric analysis that rapidly identified 154 genes harboring variants suspected to cause Mendelian disorders. The approach did not rely on case-specific phenotypic classifications but was driven by optimization of gene- and variant-level filter parameters utilizing historical Mendelian disease-gene association discovery data. Variants in 19 of the 154 candidate genes were subsequently reported as causative of a Mendelian trait and additional data support the association of all other candidate genes with disease endpoints.
Collapse
Affiliation(s)
- Adam W Hansen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mullai Murugan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael M Khayat
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Liwen Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jill Rosenfeld
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - B Kim Andrews
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Shalini N Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zeynep H Coban Akdemir
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Allison E Ashley-Koch
- Duke Molecular Physiology Institute, Duke University Medical Center, Durham, NC 27710, USA; Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Donna M Muzny
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Erica E Davis
- Pediatric Genetic and translational Medicine Center (P-GeM), Stanley Manne Children's Research Institute, Chicago, IL 60611, USA; Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Nicholas Katsanis
- Pediatric Genetic and translational Medicine Center (P-GeM), Stanley Manne Children's Research Institute, Chicago, IL 60611, USA; Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Aniko Sabo
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jennifer E Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yaping Yang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael F Wangler
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Christine M Eng
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - V Reid Sutton
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA; Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; School of Public Health, UTHealth, Houston, TX 77030, USA
| | - Richard A Gibbs
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
| |
Collapse
|
20
|
Wong YKE, Lam KW, Ho KY, Yu CSA, Cho WCS, Tsang HF, Chu MKM, Ng PWL, Tai CSW, Chan LWC, Wong EYL, Wong SCC. The applications of big data in molecular diagnostics. Expert Rev Mol Diagn 2019; 19:905-917. [PMID: 31422710 DOI: 10.1080/14737159.2019.1657834] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Accepted: 08/16/2019] [Indexed: 12/30/2022]
Abstract
Introduction: Big Data technologies instilled an informational perspective to our understanding of the world. However, fundamental issues such as the management and storage of data can create privacy concerns. Heterogeneous types of data pose challenges in reproducibility and standardization. It is now an opportunity for us to help the health-care professionals, educators, and policy-makers understand the impact of Big Data, and steer the development roadmap to positively impact the molecular diagnostic industry. Area covered: In this review, we discuss the latest trends in applying Big Data to several key areas of molecular diagnostics: metagenomics, Mendelian disease screening, personalized medicine, and metabolomics. The limitations of utilizing bioinformatics and Big Data analytic tools are also summarized. We further propose an action plan on how to prepare a new generation of health-care professionals to step into the age of Big Data through a tailor-made bioinformatics training program. Expert opinion: In order to cope with the development of these powerful technologies, issues of ethics, regulations, and data format standardization are urgently needed. Besides, a long-term planning to train medical scientists, pathologists, and specialists on bioinformatics is necessary. It is an appropriate time to review all these issues before implementing these tests for patients' diagnosis, prognosis and treatment efficacy.
Collapse
Affiliation(s)
- Yin Kwan Evelyn Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | - Ka Wai Lam
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | - Ka Yi Ho
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | | | - William Chi-Shing Cho
- Department of Clinical Oncology, Queen Elizabeth Hospital , Hong Kong Special Administrative Region
| | - Hin Fung Tsang
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | - Man Kee Maggie Chu
- Department of Life Science, The Hong Kong University of Science and Technology , Hong Kong Special Administrative Region
| | - Po Wah Lawrence Ng
- Department of Pathology, Queen Elizabeth Hospital , Hong Kong Special Administrative Region
| | - Chi Shing William Tai
- Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | - Lawrence Wing Chi Chan
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | - Elaine Yue Ling Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| | - Sze Chuen Cesar Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University , Hong Kong Special Administrative Region
| |
Collapse
|
21
|
Na JC, Lee I, Rhee JK, Shin SY. Fast single individual haplotyping method using GPGPU. Comput Biol Med 2019; 113:103421. [PMID: 31499396 DOI: 10.1016/j.compbiomed.2019.103421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 08/28/2019] [Accepted: 08/28/2019] [Indexed: 11/27/2022]
Abstract
BACKGROUND Most bioinformatic tools for next generation sequencing (NGS) data are computationally intensive, requiring a large amount of computational power for processing and analysis. Here the utility of graphic processing units (GPUs) for NGS data computation is assessed. METHOD In a previous study, we developed a probabilistic evolutionary algorithm with toggling for haplotyping (PEATH) method based on the estimation of distribution algorithm and toggling heuristic. Here, we parallelized the PEATH method (PEATH/G) using general-purpose computing on GPU (GPGPU). RESULTS The PEATH/G runs approximately 46.8 times and 25.4 times faster than PEATH on the NA12878 fosmid-sequencing dataset and the HuRef dataset, respectively, with an NVIDIA GeForce GTX 1660Ti. Moreover, the PEATH/G is approximately 13.3 times faster on the fosmid-sequencing dataset, even with an inexpensive conventional GPGPU (NVIDIA GeForce GTX 950). CONCLUSIONS PEATH/G can be a practical single individual haplotyping tool in terms of both its accuracy and speed. GPGPU can help reduce the running time of NGS analysis tools.
Collapse
Affiliation(s)
- Joong Chae Na
- Department of Computer Science and Engineering, Sejong University, Seoul, 05006, South Korea
| | - Inbok Lee
- Department of Software, Korea Aerospace University, Goyang, 10540, South Korea
| | - Je-Keun Rhee
- School of Systems Biomedical Science, Soongsil University, Seoul, 06978, South Korea.
| | - Soo-Yong Shin
- Department of Digital Health, SAIHST, Sungkyunkwan University, Seoul, 06351, South Korea; Big Data Research Center, Samsung Medical Center, Seoul, 06351, South Korea.
| |
Collapse
|
22
|
Jung H, Winefield C, Bombarely A, Prentis P, Waterhouse P. Tools and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes. TRENDS IN PLANT SCIENCE 2019; 24:700-724. [PMID: 31208890 DOI: 10.1016/j.tplants.2019.05.003] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Revised: 05/01/2019] [Accepted: 05/10/2019] [Indexed: 05/16/2023]
Abstract
The commercial release of third-generation sequencing technologies (TGSTs), giving long and ultra-long sequencing reads, has stimulated the development of new tools for assembling highly contiguous genome sequences with unprecedented accuracy across complex repeat regions. We survey here a wide range of emerging sequencing platforms and analytical tools for de novo assembly, provide background information for each of their steps, and discuss the spectrum of available options. Our decision tree recommends workflows for the generation of a high-quality genome assembly when used in combination with the specific needs and resources of a project.
Collapse
Affiliation(s)
- Hyungtaek Jung
- Centre for Tropical Crops and Biocommodities, Queensland University of Technology, Brisbane, QLD 4001, Australia.
| | - Christopher Winefield
- Department of Wine, Food, and Molecular Biosciences, Lincoln University, 7647 Christchurch, New Zealand
| | - Aureliano Bombarely
- Department of Bioscience, University of Milan, Milan 20133, Italy; School of Plants and Environmental Sciences, Virginia Tech, Blacksburg, VA 24061, USA
| | - Peter Prentis
- School of Earth, Environmental, and Biological Sciences, Queensland University of Technology, Brisbane, QLD, 4001, Australia
| | - Peter Waterhouse
- Centre for Tropical Crops and Biocommodities, Queensland University of Technology, Brisbane, QLD 4001, Australia; School of Biological Sciences, University of Sydney, Sydney, NSW 2006, Australia.
| |
Collapse
|
23
|
Roberts AD, Finnigan W, Wolde-Michael E, Kelly P, Blaker JJ, Hay S, Breitling R, Takano E, Scrutton NS. Synthetic biology for fibres, adhesives and active camouflage materials in protection and aerospace. MRS COMMUNICATIONS 2019; 9:486-504. [PMID: 31281737 PMCID: PMC6609449 DOI: 10.1557/mrc.2019.35] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Accepted: 03/12/2019] [Indexed: 05/03/2023]
Abstract
Synthetic biology has huge potential to produce the next generation of advanced materials by accessing previously unreachable (bio)chemical space. In this prospective review, we take a snapshot of current activity in this rapidly developing area, focussing on prominent examples for high-performance applications such as those required for protective materials and the aerospace sector. The continued growth of this emerging field will be facilitated by the convergence of expertise from a range of diverse disciplines, including molecular biology, polymer chemistry, materials science and process engineering. This review highlights the most significant recent advances and address the cross-disciplinary challenges currently being faced.
Collapse
Affiliation(s)
- Aled D. Roberts
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
- Bio-Active Materials Group, School of Materials, The University of
Manchester, Manchester, UK, M13 9PL
| | - William Finnigan
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| | - Emmanuel Wolde-Michael
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| | - Paul Kelly
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| | - Jonny J. Blaker
- Bio-Active Materials Group, School of Materials, The University of
Manchester, Manchester, UK, M13 9PL
| | - Sam Hay
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| | - Rainer Breitling
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| | - Eriko Takano
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| | - Nigel S. Scrutton
- Manchester Institute of Biotechnology, Manchester Synthetic Biology
Research Centre SYBIOCHEM, School of Chemistry, The University of Manchester,
Manchester, UK, M1 7DN
| |
Collapse
|
24
|
D'Argenio V. The High-Throughput Analyses Era: Are We Ready for the Data Struggle? High Throughput 2018; 7:E8. [PMID: 29498666 PMCID: PMC5876534 DOI: 10.3390/ht7010008] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 02/16/2018] [Accepted: 02/27/2018] [Indexed: 12/23/2022] Open
Abstract
Recent and rapid technological advances in molecular sciences have dramatically increased the ability to carry out high-throughput studies characterized by big data production. This, in turn, led to the consequent negative effect of highlighting the presence of a gap between data yield and their analysis. Indeed, big data management is becoming an increasingly important aspect of many fields of molecular research including the study of human diseases. Now, the challenge is to identify, within the huge amount of data obtained, that which is of clinical relevance. In this context, issues related to data interpretation, sharing and storage need to be assessed and standardized. Once this is achieved, the integration of data from different -omic approaches will improve the diagnosis, monitoring and therapy of diseases by allowing the identification of novel, potentially actionably biomarkers in view of personalized medicine.
Collapse
Affiliation(s)
- Valeria D'Argenio
- CEINGE-Biotecnologie Avanzate, via G. Salvatore 486, 80145 Naples, Italy.
- Department of Molecular Medicine and Medical Biotechnologies, University of Naples Federico II, via Pansini 5, 80131 Naples, Italy.
| |
Collapse
|
25
|
Di Donato A, Filippone E, Ercolano MR, Frusciante L. Genome Sequencing of Ancient Plant Remains: Findings, Uses and Potential Applications for the Study and Improvement of Modern Crops. FRONTIERS IN PLANT SCIENCE 2018; 9:441. [PMID: 29719544 PMCID: PMC5914272 DOI: 10.3389/fpls.2018.00441] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 03/21/2018] [Indexed: 05/08/2023]
Abstract
The advent of new sequencing technologies is revolutionizing the studies of ancient DNA (aDNA). In the last 30 years, DNA extracted from the ancient remains of several plant species has been explored in small-scale studies, contributing to understand the adaptation, and migration patterns of important crops. More recently, NGS technologies applied on aDNA have opened up new avenues of research, allowing investigation of the domestication process on the whole-genome scale. Genomic approaches based on genome-wide and targeted sequencing have been shown to provide important information on crop evolution and on the history of agriculture. Huge amounts of next-generation sequencing (NGS) data offer various solutions to overcome problems related to the origin of the material, such as degradation, fragmentation of polynucleotides, and external contamination. Recent advances made in several crop domestication studies have boosted interest in this research area. Remains of any nature are potential candidates for aDNA recovery and almost all the analyses that can be made on fresh DNA can also be performed on aDNA. The analysis performed on aDNA can shed light on many phylogenetic questions concerning evolution, domestication, and improvement of plant species. It is a powerful instrument to reconstruct patterns of crop adaptation and migration. Information gathered can also be used in many fields of modern agriculture such as classical breeding, genome editing, pest management, and product promotion. Whilst unlocking the hidden genome of ancient crops offers great potential, the onus is now on the research community to use such information to gain new insight into agriculture.
Collapse
|