1
|
Zhou X, Wei Z, Lu H, He J, Gao Y, Hu X, Wang C, Dong Y, Liu H. Large-Scale Molecular Dynamics Simulation Based on Heterogeneous Many-Core Architecture. J Chem Inf Model 2024; 64:851-861. [PMID: 38299978 DOI: 10.1021/acs.jcim.3c01254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2024]
Abstract
As the application of molecular dynamics (MD) simulations continues to evolve, the demand for accelerating large-scale simulation systems and handling of enormous simulation tasks is steadily increasing. We propose a parallel acceleration method for large-scale MD simulations based on Sunway heterogeneous many-core processors. This method integrates task scheduling, simulation calculations, and data storage, effectively tackling issues related to large-scale simulations and numerous simulation tasks. The task scheduling strategy flexibly handles tasks on various scales and enables parallel execution of multiple tasks. During the simulation calculations, we ported GROMACS to the Sunway architecture and accelerated the calculation of short-range forces through a heterogeneous processor. Our method achieves approximately 10-fold acceleration and 90% scalability when executing a single simulation task. When handling numerous simulation tasks, our method achieves parallel execution of all of the tasks with 90% scalability. By employing our method, we carried out 50 ns simulations on over 3000 distinct conotoxin structures individually within just 5 h. Additionally, we evaluated more than 200 protein-ligand complexes, and the simulation efficiency significantly exceeded that of midsized to small GPU clusters.
Collapse
Affiliation(s)
- Xu Zhou
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Zhiqiang Wei
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Hao Lu
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Jiaqi He
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Yuan Gao
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Xiaotong Hu
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Cunji Wang
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Yujie Dong
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| | - Hao Liu
- College of Computer Science and Technology, Ocean University of China, Qingdao 266100, China
| |
Collapse
|
2
|
Zulfiqar M, Singh V, Steinbeck C, Sorokina M. Review on computer-assisted biosynthetic capacities elucidation to assess metabolic interactions and communication within microbial communities. Crit Rev Microbiol 2024:1-40. [PMID: 38270170 DOI: 10.1080/1040841x.2024.2306465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 01/12/2024] [Indexed: 01/26/2024]
Abstract
Microbial communities thrive through interactions and communication, which are challenging to study as most microorganisms are not cultivable. To address this challenge, researchers focus on the extracellular space where communication events occur. Exometabolomics and interactome analysis provide insights into the molecules involved in communication and the dynamics of their interactions. Advances in sequencing technologies and computational methods enable the reconstruction of taxonomic and functional profiles of microbial communities using high-throughput multi-omics data. Network-based approaches, including community flux balance analysis, aim to model molecular interactions within and between communities. Despite these advances, challenges remain in computer-assisted biosynthetic capacities elucidation, requiring continued innovation and collaboration among diverse scientists. This review provides insights into the current state and future directions of computer-assisted biosynthetic capacities elucidation in studying microbial communities.
Collapse
Affiliation(s)
- Mahnoor Zulfiqar
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Vinay Singh
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
| | - Christoph Steinbeck
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Maria Sorokina
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University, Jena, Germany
- Data Science and Artificial Intelligence, Research and Development, Pharmaceuticals, Bayer, Berlin, Germany
| |
Collapse
|
3
|
Garzón W, Benavides L, Gaignard A, Redon R, Südholt M. A taxonomy of tools and approaches for distributed genomic analyses. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022] Open
|
4
|
Mu T, Hu H, Ma Y, Wen H, Yang C, Feng X, Wen W, Zhang J, Gu Y. Identifying key genes in milk fat metabolism by weighted gene co-expression network analysis. Sci Rep 2022; 12:6836. [PMID: 35477736 PMCID: PMC9046402 DOI: 10.1038/s41598-022-10435-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 03/21/2022] [Indexed: 12/13/2022] Open
Abstract
Milk fat is the most important and energy-rich substance in milk, and its content and composition are important reference elements in the evaluation of milk quality. However, the current identification of valuable candidate genes affecting milk fat is limited. IlluminaPE150 was used to sequence bovine mammary epithelial cells (BMECs) with high and low milk fat rates (MFP), the weighted gene co-expression network (WGCNA) was used to analyze mRNA expression profile data in this study. As a result, a total of 10,310 genes were used to construct WGCNA, and the genes were classified into 18 modules. Among them, violet (r = 0.74), yellow (r = 0.75) and darkolivegreen (r = − 0.79) modules were significantly associated with MFP, and 39, 181, 75 hub genes were identified, respectively. Combining enrichment analysis and differential genes (DEs), we screened five key candidate DEs related to lipid metabolism, namely PI4K2A, SLC16A1, ATP8A2, VEGFD and ID1, respectively. Relative to the small intestine, liver, kidney, heart, ovary and uterus, the gene expression of PI4K2A is the highest in mammary gland, and is significantly enriched in GO terms and pathways related to milk fat metabolism, such as monocarboxylic acid transport, phospholipid transport, phosphatidylinositol signaling system, inositol phosphate metabolism and MAPK signaling pathway. This study uses WGCNA to form an overall view of MFP, providing a theoretical basis for identifying potential pathways and hub genes that may be involved in milk fat synthesis.
Collapse
Affiliation(s)
- Tong Mu
- School of Agriculture, Ningxia University, Yinchuan, 750021, China
| | - Honghong Hu
- School of Agriculture, Ningxia University, Yinchuan, 750021, China
| | - Yanfen Ma
- School of Agriculture, Ningxia University, Yinchuan, 750021, China.,Key Laboratory of Ruminant Molecular and Cellular Breeding, Ningxia Hui Autonomous Region, Ningxia University, Yinchuan, 750021, China
| | - Huiyu Wen
- Maosheng Pasture of He Lanshan in Ningxia State Farm, Yinchuan, 750001, China
| | - Chaoyun Yang
- School of Agriculture, Ningxia University, Yinchuan, 750021, China
| | - Xiaofang Feng
- School of Agriculture, Ningxia University, Yinchuan, 750021, China
| | - Wan Wen
- Animal Husbandry Extension Station, Yinchuan, 750001, China
| | - Juan Zhang
- School of Agriculture, Ningxia University, Yinchuan, 750021, China
| | - Yaling Gu
- School of Agriculture, Ningxia University, Yinchuan, 750021, China.
| |
Collapse
|
5
|
Decap D, de Schaetzen van Brienen L, Larmuseau M, Costanza P, Herzeel C, Wuyts R, Marchal K, Fostier J. Halvade somatic: Somatic variant calling with Apache Spark. Gigascience 2022; 11:6505120. [PMID: 35022699 PMCID: PMC8756192 DOI: 10.1093/gigascience/giab094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 10/27/2021] [Accepted: 12/09/2021] [Indexed: 12/02/2022] Open
Abstract
Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.
Collapse
Affiliation(s)
- Dries Decap
- IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium
| | | | - Maarten Larmuseau
- IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium
| | | | | | - Roel Wuyts
- imec, Kapeldreef 75, B-3001 Leuven, Belgium
| | - Kathleen Marchal
- IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium
| | - Jan Fostier
- IDLab, Ghent University - imec, Technologiepark 126, B-9052 Ghent, Belgium
| |
Collapse
|
6
|
Bauer DC, Wilson LOW, Twine NA. Artificial Intelligence in Medicine: Applications, Limitations and Future Directions. Artif Intell Med 2022. [DOI: 10.1007/978-981-19-1223-8_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
7
|
Clustering and Classification Based on Distributed Automatic Feature Engineering for Customer Segmentation. Symmetry (Basel) 2021. [DOI: 10.3390/sym13091557] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
To beat competition and obtain valuable information, decision-makers must conduct in-depth machine learning or data mining for data analytics. Traditionally, clustering and classification are two common methods used in machine mining. For clustering, data are divided into various groups according to the similarity or common features. On the other hand, classification refers to building a model by given training data, where the target class or label is predicted for the test data. In recent years, many researchers focus on the hybrid of clustering and classification. These techniques have admirable achievements, but there is still room to ameliorate performances, such as distributed process. Therefore, we propose clustering and classification based on distributed automatic feature engineering (AFE) for customer segmentation in this paper. In the proposed algorithm, AFE uses artificial bee colony (ABC) to select valuable features of input data, and then RFM provides the basic data analytics. In AFE, it first initializes the number of cluster k. Moreover, the clustering methods of k-means, Wald method, and fuzzy c-means (FCM) are processed to cluster the examples in variant groups. Finally, the classification method of an improved fuzzy decision tree classifies the target data and generates decision rules for explaining the detail situations. AFE also determines the value of the split number in the improved fuzzy decision tree to increase classification accuracy. The proposed clustering and classification based on automatic feature engineering is distributed, performed in Apache Spark platform. The topic of this paper is about solving the problem of clustering and classification for machine learning. From the results, the corresponding classification accuracy outperforms other approaches. Moreover, we also provide useful strategies and decision rules from data analytics for decision-makers.
Collapse
|
8
|
Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinform Biol Insights 2021; 15:11779322211035921. [PMID: 34376975 PMCID: PMC8323418 DOI: 10.1177/11779322211035921] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
High-throughput experiments enable researchers to explore complex multifactorial
diseases through large-scale analysis of omics data. Challenges for such
high-dimensional data sets include storage, analyses, and sharing. Recent
innovations in computational technologies and approaches, especially in cloud
computing, offer a promising, low-cost, and highly flexible solution in the
bioinformatics domain. Cloud computing is rapidly proving increasingly useful in
molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or
proteomics data sets), and for the integration, analysis, and interpretation of
phenotypic data. We review the adoption of advanced cloud-based and big data
technologies for processing and analyzing omics data and provide insights into
state-of-the-art cloud bioinformatics applications.
Collapse
Affiliation(s)
- Saraswati Koppad
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Annappa B
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Biomedical Research Centre, University Hospitals Birmingham, Birmingham, UK
| | - Animesh Acharjee
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK
| |
Collapse
|
9
|
Parallel Delay Multiply and Sum Algorithm for Microwave Medical Imaging Using Spark Big Data Framework. ALGORITHMS 2021. [DOI: 10.3390/a14050157] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Microwave imaging systems are currently being investigated for breast cancer, brain stroke and neurodegenerative disease detection due to their low cost, portable and wearable nature. At present, commonly used radar-based algorithms for microwave imaging are based on the delay and sum algorithm. These algorithms use ultra-wideband signals to reconstruct a 2D image of the targeted object or region. Delay multiply and sum is an extended version of the delay and sum algorithm. However, it is computationally expensive and time-consuming. In this paper, the delay multiply and sum algorithm is parallelised using a big data framework. The algorithm uses the Spark MapReduce programming model to improve its efficiency. The most computational part of the algorithm is pixel value calculation, where signals need to be multiplied in pairs and summed. The proposed algorithm broadcasts the input data and executes it in parallel in a distributed manner. The Spark-based parallel algorithm is compared with sequential and Python multiprocessing library implementation. The experimental results on both a standalone machine and a high-performance cluster show that Spark significantly accelerates the image reconstruction process without affecting its accuracy.
Collapse
|
10
|
Adil A, Kumar V, Jan AT, Asger M. Single-Cell Transcriptomics: Current Methods and Challenges in Data Acquisition and Analysis. Front Neurosci 2021; 15:591122. [PMID: 33967674 PMCID: PMC8100238 DOI: 10.3389/fnins.2021.591122] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Accepted: 03/19/2021] [Indexed: 11/17/2022] Open
Abstract
Rapid cost drops and advancements in next-generation sequencing have made profiling of cells at individual level a conventional practice in scientific laboratories worldwide. Single-cell transcriptomics [single-cell RNA sequencing (SC-RNA-seq)] has an immense potential of uncovering the novel basis of human life. The well-known heterogeneity of cells at the individual level can be better studied by single-cell transcriptomics. Proper downstream analysis of this data will provide new insights into the scientific communities. However, due to low starting materials, the SC-RNA-seq data face various computational challenges: normalization, differential gene expression analysis, dimensionality reduction, etc. Additionally, new methods like 10× Chromium can profile millions of cells in parallel, which creates a considerable amount of data. Thus, single-cell data handling is another big challenge. This paper reviews the single-cell sequencing methods, library preparation, and data generation. We highlight some of the main computational challenges that require to be addressed by introducing new bioinformatics algorithms and tools for analysis. We also show single-cell transcriptomics data as a big data problem.
Collapse
Affiliation(s)
- Asif Adil
- Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, India
| | - Vijay Kumar
- Department of Biotechnology, Yeungnam University, Gyeongsan, South Korea
| | - Arif Tasleem Jan
- School of Biosciences and Biotechnology, Baba Ghulam Shah Badshah University, Rajouri, India
| | - Mohammed Asger
- Department of Computer Sciences, Baba Ghulam Shah Badshah University, Rajouri, India
| |
Collapse
|
11
|
Marcon Y, Bishop T, Avraam D, Escriba-Montagut X, Ryser-Welch P, Wheater S, Burton P, González JR. Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD. PLoS Comput Biol 2021; 17:e1008880. [PMID: 33784300 PMCID: PMC8034722 DOI: 10.1371/journal.pcbi.1008880] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 04/09/2021] [Accepted: 03/17/2021] [Indexed: 01/31/2023] Open
Abstract
Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown).
Collapse
Affiliation(s)
| | - Tom Bishop
- MRC Epidemiology Unit, University of Cambridge, Cambridge, United Kingdom
| | - Demetris Avraam
- Population Health Sciences Institute, Newcastle University, Newcastle, United Kingdom
| | - Xavier Escriba-Montagut
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Patricia Ryser-Welch
- Population Health Sciences Institute, Newcastle University, Newcastle, United Kingdom
| | | | - Paul Burton
- Population Health Sciences Institute, Newcastle University, Newcastle, United Kingdom
| | - Juan R. González
- Barcelona Institute for Global Health (ISGlobal), Barcelona, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Centro de Investigación Biomédica en Red en Epidemiología y Salud Pública (CIBERESP), Barcelona, Spain
- Dept. of Mathematics, Universitat Autònoma de Barcelona (UAB), Bellaterra (Barcelona), Spain
| |
Collapse
|
12
|
Lactation Associated Genes Revealed in Holstein Dairy Cows by Weighted Gene Co-Expression Network Analysis (WGCNA). Animals (Basel) 2021; 11:ani11020314. [PMID: 33513831 PMCID: PMC7911360 DOI: 10.3390/ani11020314] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Accepted: 01/23/2021] [Indexed: 02/07/2023] Open
Abstract
Simple Summary Weighted gene coexpression network analysis (WGCNA) is a novel approach that can quickly analyze the relationships between genes and traits. In the past few years, studies on the gene expression changes of dairy cow mammary glands were only based on transcriptome comparisons between two lactation stages. Few studies focused on the relationships between gene expression of the dairy mammary gland and lactation stage or milk composition in a lactation cycle. In this study, we detected milk yield and composition in a lactation cycle. For the first time, we constructed a gene coexpression network using WGCNA on the basis of 18 gene expression profiles during six stages of a lactation cycle by transcriptome sequencing, generating 10 specific modules. Genes in each module were performed with gene ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Module–trait relationship analysis showed a series of potential candidates related to milk yield and composition. The current study provides an important theoretical basis for the further molecular breeding of dairy cows. Abstract Weighted gene coexpression network analysis (WGCNA) is a novel approach that can quickly analyze the relationships between genes and traits. In this study, the milk yield, lactose, fat, and protein of Holstein dairy cows were detected in a lactation cycle. Meanwhile, a total of 18 gene expression profiles were detected using mammary glands from six lactation stages (day 7 to calving, −7 d; day 30 post-calving, 30 d; day 90 post-calving, 90 d; day 180 post-calving, 180 d; day 270 post-calving, 270 d; day 315 post-calving, 315 d). On the basis of the 18 profiles, WGCNA identified for the first time 10 significant modules that may be related to lactation stage, milk yield, and the main milk composition content. Genes in the 10 significant modules were examined with gene ontology (GO) annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. The results revealed that the galactose metabolism pathway was a potential candidate for milk yield and milk lactose synthesis. In −7 d, ion transportation was more frequent and cell proliferation related terms became active. In late lactation, the suppressor of cytokine signaling 3 (SOCS3) might play a role in apoptosis. The sphingolipid signaling pathway was a potential candidate for milk fat synthesis. Dairy cows at 315 d were in a period of cell proliferation. Another notable phenomenon was that nonlactating dairy cows had a more regular circadian rhythm after a cycle of lactation. The results provide an important theoretical basis for the further molecular breeding of dairy cows.
Collapse
|
13
|
Kounelis F, Kanavos A, Mylonas P. Improving the Run-Time of Space-Efficient n-Gram Data Structures Using Apache Spark. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2021; 1338:165-173. [DOI: 10.1007/978-3-030-78775-2_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
14
|
Abuín JM, Lopes N, Ferreira L, Pena TF, Schmidt B. Big Data in metagenomics: Apache Spark vs MPI. PLoS One 2020; 15:e0239741. [PMID: 33022000 PMCID: PMC7537910 DOI: 10.1371/journal.pone.0239741] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 09/14/2020] [Indexed: 11/23/2022] Open
Abstract
The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.
Collapse
Affiliation(s)
- José M. Abuín
- 2Ai—School of Technology, IPCA, Barcelos, Portugal
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
- * E-mail:
| | - Nuno Lopes
- 2Ai—School of Technology, IPCA, Barcelos, Portugal
| | | | - Tomás F. Pena
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz, Germany
| |
Collapse
|
15
|
Chen W, Yao C, Guo Y, Wang Y, Xue Z. pmTM-align: scalable pairwise and multiple structure alignment with Apache Spark and OpenMP. BMC Bioinformatics 2020; 21:426. [PMID: 32993484 PMCID: PMC7526426 DOI: 10.1186/s12859-020-03757-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 09/16/2020] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Structure comparison can provide useful information to identify functional and evolutionary relationship between proteins. With the dramatic increase of protein structure data in the Protein Data Bank, computation time quickly becomes the bottleneck for large scale structure comparisons. To more efficiently deal with informative multiple structure alignment tasks, we propose pmTM-align, a parallel protein structure alignment approach based on mTM-align/TM-align. pmTM-align contains two stages to handle pairwise structure alignments with Spark and the phylogenetic tree-based multiple structure alignment task on a single computer with OpenMP. RESULTS Experiments with the SABmark dataset showed that parallelization along with data structure optimization provided considerable speedup for mTM-align. The Spark-based structure alignments achieved near ideal scalability with large datasets, and the OpenMP-based construction of the phylogenetic tree accelerated the incremental alignment of multiple structures and metrics computation by a factor of about 2-5. CONCLUSIONS pmTM-align enables scalable pairwise and multiple structure alignment computing and offers more timely responses for medium to large-sized input data than existing alignment tools such as mTM-align.
Collapse
Affiliation(s)
- Weiya Chen
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Chun Yao
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Yingzhong Guo
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Yan Wang
- School of Life Science, Huazhong University of Science and Technology, Wuhan, China
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China.
| |
Collapse
|
16
|
Abstract
Motivation Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources. Results We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. Availability and implementation Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Camilo Valdes
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA
| | - Vitalii Stebliankin
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA
| | - Giri Narasimhan
- Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, Miami, FL, USA.,Biomolecular Sciences Institute, Florida International University, Miami, FL, USA
| |
Collapse
|
17
|
Capuccini M, Dahlö M, Toor S, Spjuth O. MaRe: Processing Big Data with application containers on Apache Spark. Gigascience 2020; 9:giaa042. [PMID: 32369166 PMCID: PMC7199472 DOI: 10.1093/gigascience/giaa042] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Revised: 02/10/2020] [Accepted: 04/07/2020] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. RESULTS Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. CONCLUSIONS MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
Collapse
Affiliation(s)
- Marco Capuccini
- Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
| | - Martin Dahlö
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
- Science for Life Laboratory, Uppsala University, Box 591, 751 24, Uppsala, Sweden
- Uppsala Multidisciplinary Center for Advanced Computational Science, Uppsala University, Box 337, 75105, Uppsala, Sweden
| | - Salman Toor
- Department of Information Technology, Uppsala University, Box 337, 75105, Uppsala, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 751 24, Uppsala, Sweden
| |
Collapse
|
18
|
Lee CY, Chattopadhyay A, Chiang LM, Juang JMJ, Lai LC, Tsai MH, Lu TP, Chuang EY. VariED: the first integrated database of gene annotation and expression profiles for variants related to human diseases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5533239. [PMID: 31317185 PMCID: PMC6637258 DOI: 10.1093/database/baz075] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 05/15/2019] [Accepted: 05/17/2019] [Indexed: 12/18/2022]
Abstract
Integrated analysis of DNA variants and gene expression profiles may facilitate precise identification of gene regulatory networks involved in disease mechanisms. Despite the widespread availability of public resources, we lack databases that are capable of simultaneously providing gene expression profiles, variant annotations, functional prediction scores and pathogenic analyses. VariED is the first web-based querying system that integrates an annotation database and expression profiles for genetic variants. The database offers a user-friendly platform and locates gene/variant names in the literature by connecting to established online querying tools, biological annotation tools and records from free-text literature. VariED acts as a central hub for organized genome information consisting of gene annotation, variant allele frequency, functional prediction, clinical interpretation and gene expression profiles in three species: human, mouse and zebrafish. VariED also provides a novel scoring scheme to predict the functional impact of a DNA variant. With one single entry, all results regarding queried DNA variants can be downloaded. VariED can potentially serve as an efficient way to obtain comprehensive variant knowledge for clinicians and scientists around the world working on important drug discoveries and precision treatments.
Collapse
Affiliation(s)
- Chien-Yueh Lee
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Amrita Chattopadhyay
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan
| | - Li-Mei Chiang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
| | - Jyh-Ming Jimmy Juang
- Cardiovascular Center and Division of Cardiology, Department of Internal Medicine, National Taiwan University Hospital, Taipei, Taiwan.,College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Liang-Chuan Lai
- Graduate Institute of Physiology, National Taiwan University, Taipei, Taiwan
| | - Mong-Hsun Tsai
- Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan.,Institute of Biotechnology, National Taiwan University, Taipei, Taiwan.,Center for Biotechnology, National Taiwan University, Taipei, Taiwan
| | - Tzu-Pin Lu
- Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan.,Department of Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Eric Y Chuang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.,Bioinformatics and Biostatistics Core, Center of Genomic Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
19
|
Shi L, Wang Z. Computational Strategies for Scalable Genomics Analysis. Genes (Basel) 2019; 10:E1017. [PMID: 31817630 PMCID: PMC6947637 DOI: 10.3390/genes10121017] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 12/01/2019] [Accepted: 12/03/2019] [Indexed: 12/14/2022] Open
Abstract
The revolution in next-generation DNA sequencing technologies is leading to explosive data growth in genomics, posing a significant challenge to the computing infrastructure and software algorithms for genomics analysis. Various big data technologies have been explored to scale up/out current bioinformatics solutions to mine the big genomics data. In this review, we survey some of these exciting developments in the applications of parallel distributed computing and special hardware to genomics. We comment on the pros and cons of each strategy in the context of ease of development, robustness, scalability, and efficiency. Although this review is written for an audience from the genomics and bioinformatics fields, it may also be informative for the audience of computer science with interests in genomics applications.
Collapse
Affiliation(s)
- Lizhen Shi
- Department of Computer Science, Florida State University, Tallahassee, FL 32304, USA;
| | - Zhong Wang
- US Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
- School of Natural Sciences, University of California at Merced, Merced, CA 95343, USA
| |
Collapse
|
20
|
Dirmeier S, Emmenlauer M, Dehio C, Beerenwinkel N. PyBDA: a command line tool for automated analysis of big biological data sets. BMC Bioinformatics 2019; 20:564. [PMID: 31718539 PMCID: PMC6849186 DOI: 10.1186/s12859-019-3087-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 09/09/2019] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. RESULTS We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. CONCLUSION PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.
Collapse
Affiliation(s)
- Simon Dirmeier
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mario Emmenlauer
- Biozentrum, University of Basel, Basel, Switzerland
- BioDataAnalysis GmbH, Munich, 81669 Germany
| | | | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
21
|
Nanni L, Pinoli P, Canakoglu A, Ceri S. PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets. BMC Bioinformatics 2019; 20:560. [PMID: 31703553 PMCID: PMC6842186 DOI: 10.1186/s12859-019-3159-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2019] [Accepted: 10/14/2019] [Indexed: 11/26/2022] Open
Abstract
Background With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. Results We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. Conclusions PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.
Collapse
Affiliation(s)
- Luca Nanni
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.
| | - Pietro Pinoli
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| | - Arif Canakoglu
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| | - Stefano Ceri
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| |
Collapse
|
22
|
Linderman MD, Chia D, Wallace F, Nothaft FA. DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark. BMC Bioinformatics 2019; 20:493. [PMID: 31604420 PMCID: PMC6787990 DOI: 10.1186/s12859-019-3108-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Accepted: 09/20/2019] [Indexed: 11/16/2022] Open
Abstract
Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.
Collapse
Affiliation(s)
- Michael D Linderman
- Department of Computer Science, Middlebury College, 75 Shannon St, Middlebury, VT, 05753, USA.
| | - Davin Chia
- Department of Computer Science, Middlebury College, 75 Shannon St, Middlebury, VT, 05753, USA
| | - Forrest Wallace
- Department of Computer Science, Middlebury College, 75 Shannon St, Middlebury, VT, 05753, USA
| | - Frank A Nothaft
- AMPLab, University of California, Berkeley, Berkeley, CA, USA.,Databricks, Inc., San Francisco, CA, USA
| |
Collapse
|
23
|
Wong YKE, Lam KW, Ho KY, Yu CSA, Cho CSW, Tsang HF, Chu MKM, Ng PWL, Tai CSW, Chan LWC, Wong EYL, Wong SCC. The applications of big data in molecular diagnostics. Expert Rev Mol Diagn 2019; 19:905-917. [PMID: 31422710 DOI: 10.1080/14737159.2019.1657834] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Affiliation(s)
- Yin Kwan Evelyn Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Ka Wai Lam
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Ka Yi Ho
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | | | - Chi Shing William Cho
- Department of Clinical Oncology, Queen Elizabeth Hospital, Hong Kong Special Administrative Region
| | - Hin Fung Tsang
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Man Kee Maggie Chu
- Department of Life Science, The Hong Kong University of Science and Technology, Hong Kong Special Administrative Region
| | - Po Wah Lawrence Ng
- Department of Pathology, Queen Elizabeth Hospital, Hong Kong Special Administrative Region
| | - Chi Shing William Tai
- Department of Applied Biology and Chemical Technology, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Lawrence Wing Chi Chan
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Elaine Yue Ling Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| | - Sze Chuen Cesar Wong
- Department of Health Technology and Informatics, Hong Kong Polytechnic University, Hong Kong Special Administrative Region
| |
Collapse
|
24
|
Xuan P, Sun C, Zhang T, Ye Y, Shen T, Dong Y. Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs. Front Genet 2019; 10:459. [PMID: 31214240 PMCID: PMC6555260 DOI: 10.3389/fgene.2019.00459] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Accepted: 04/30/2019] [Indexed: 02/01/2023] Open
Abstract
Determining the target genes that interact with drugs—drug–target interactions—plays an important role in drug discovery. Identification of drug–target interactions through biological experiments is time consuming, laborious, and costly. Therefore, using computational approaches to predict candidate targets is a good way to reduce the cost of wet-lab experiments. However, the known interactions (positive samples) and the unknown interactions (negative samples) display a serious class imbalance, which has an adverse effect on the accuracy of the prediction results. To mitigate the impact of class imbalance and completely exploit the negative samples, we proposed a new method, named DTIGBDT, based on gradient boosting decision trees, for predicting candidate drug–target interactions. We constructed a drug–target heterogeneous network that contains the drug similarities based on the chemical structures of drugs, the target similarities based on target sequences, and the known drug–target interactions. The topological information of the network was captured by random walks to update the similarities between drugs or targets. The paths between drugs and targets could be divided into multiple categories, and the features of each category of paths were extracted. We constructed a prediction model based on gradient boosting decision trees. The model establishes multiple decision trees with the extracted features and obtains the interaction scores between drugs and targets. DTIGBDT is a method of ensemble learning, and it effectively reduces the impact of class imbalance. The experimental results indicate that DTIGBDT outperforms several state-of-the-art methods for drug–target interaction prediction. In addition, case studies on Quetiapine, Clozapine, Olanzapine, Aripiprazole, and Ziprasidone demonstrate the ability of DTIGBDT to discover potential drug–target interactions.
Collapse
Affiliation(s)
- Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Chang Sun
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Tiangang Zhang
- School of Mathematical Science, Heilongjiang University, Harbin, China
| | - Yilin Ye
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Tonghui Shen
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Yihua Dong
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| |
Collapse
|
25
|
Guo M, Xu E, Ai D. Inferring Bacterial Infiltration in Primary Colorectal Tumors From Host Whole Genome Sequencing Data. Front Genet 2019; 10:213. [PMID: 30930939 PMCID: PMC6428740 DOI: 10.3389/fgene.2019.00213] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2019] [Accepted: 02/27/2019] [Indexed: 12/13/2022] Open
Abstract
Colorectal cancer is the third most common cancer worldwide with abysmal survival, thus requiring novel therapy strategies. Numerous studies have frequently observed infiltrating bacteria within the primary tumor tissues derived from patients. These studies have implicated the relative abundance of these bacteria as a contributing factor in tumor progression. Infiltrating bacteria are believed to be among the major drivers of tumorigenesis, progression, and metastasis and, hence, promising targets for new treatments. However, measuring their abundance directly remains challenging. One potential approach is to use the unmapped reads of host whole genome sequencing (hWGS) data, which previous studies have considered as contaminants and discarded. Here, we developed rigorous bioinformatics and statistical procedures to identify tumor-infiltrating bacteria associated with colorectal cancer from such whole genome sequencing data. Our approach used the reads of whole genome sequencing data of colon adenocarcinoma tissues not mapped to the human reference genome, including unmapped paired-end read pairs and single-end reads, the mates of which were mapped. We assembled the unmapped read pairs, remapped all those reads to the collection of human microbiome reference, and then computed their relative abundance of microbes by maximum likelihood (ML) estimation. We analyzed and compared the relative abundance and diversity of infiltrating bacteria between primary tumor tissues and associated normal blood samples. Our results showed that primary tumor tissues contained far more diverse total infiltrating bacteria than normal blood samples. The relative abundance of Bacteroides fragilis, Bacteroides dorei, and Fusobacterium nucleatum was significantly higher in primary colorectal tumors. These three bacteria were among the top ten microbes in the primary tumor tissues, yet were rarely found in normal blood samples. As a validation step, most of these bacteria were also closely associated with colorectal cancer in previous studies with alternative approaches. In summary, our approach provides a new analytic technique for investigating the infiltrating bacterial community within tumor tissues. Our novel cloud-based bioinformatics and statistical pipelines to analyze the infiltrating bacteria in colorectal tumors using the unmapped reads of whole genome sequences can be freely accessed from GitHub at https://github.com/gutmicrobes/UMIB.git.
Collapse
Affiliation(s)
- Man Guo
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Er Xu
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
- Basic Experimental of Natural Science, University of Science and Technology Beijing, Beijing, China
| |
Collapse
|
26
|
Guo M, Zou Q. Perspectives of Bioinformatics in Big Data Era. Curr Genomics 2019; 20:79-80. [PMID: 31555058 PMCID: PMC6728898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Affiliation(s)
| | - Quan Zou
- Address correspondence to this author at the Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China; E-mail:
| |
Collapse
|
27
|
Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience 2019; 8:5266304. [PMID: 30597002 PMCID: PMC6354030 DOI: 10.1093/gigascience/giy165] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 12/17/2018] [Indexed: 11/23/2022] Open
Abstract
Background Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content. Results We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community. Conclusions A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
Collapse
Affiliation(s)
- Illyoung Choi
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Alise J Ponsero
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Matthew Bomhoff
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Ken Youens-Clark
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA
| | - John H Hartman
- Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA
| | - Bonnie L Hurwitz
- Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA.,BIO5 Institute, University of Arizona, 1657 E. Helen Street, Tucson, Arizona, 85719, USA
| |
Collapse
|
28
|
Alnasir JJ, Shanahan HP. The application of Hadoop in structural bioinformatics. Brief Bioinform 2018; 21:96-105. [PMID: 30462158 DOI: 10.1093/bib/bby106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 09/20/2018] [Accepted: 10/05/2018] [Indexed: 11/13/2022] Open
Abstract
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.
Collapse
Affiliation(s)
- Jamie J Alnasir
- Institute of Cancer Research, Old Brompton Road, London, United Kingdom
| | - Hugh P Shanahan
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| |
Collapse
|