1
|
Herazo-Álvarez J, Mora M, Cuadros-Orellana S, Vilches-Ponce K, Hernández-García R. A review of neural networks for metagenomic binning. Brief Bioinform 2025; 26:bbaf065. [PMID: 40131312 PMCID: PMC11934572 DOI: 10.1093/bib/bbaf065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 01/02/2025] [Accepted: 03/07/2025] [Indexed: 03/26/2025] Open
Abstract
One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.
Collapse
Affiliation(s)
- Jair Herazo-Álvarez
- Doctorado en Modelamiento Matemático Aplicado, Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Marco Mora
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Sara Cuadros-Orellana
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Centro de Biotecnología de los Recursos Naturales (CENBio), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Karina Vilches-Ponce
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Ruber Hernández-García
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| |
Collapse
|
2
|
Nowicki M, Mroczek M, Mukhedkar D, Bała P, Nikolai Pimenoff V, Arroyo Mühr LS. HPV-KITE: sequence analysis software for rapid HPV genotype detection. Brief Bioinform 2025; 26:bbaf155. [PMID: 40205852 PMCID: PMC11982018 DOI: 10.1093/bib/bbaf155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 03/04/2025] [Accepted: 03/19/2025] [Indexed: 04/11/2025] Open
Abstract
Human papillomaviruses (HPVs) are among the most diverse viral families that infect humans. Fortunately, only a small number of closely related HPV types affect human health, most notably by causing nearly all cervical cancers, as well as some oral and other anogenital cancers, particularly when infections with high-risk HPV types become persistent. Numerous viral polymerase chain reaction-based diagnostic methods as well as sequencing protocols have been developed for accurate, rapid, and efficient HPV genotyping. However, due to the large number of closely related HPV genotypes and the abundance of nonviral DNA in human derived biological samples, it can be challenging to correctly detect HPV genotypes using high throughput deep sequencing. Here, we introduce a novel HPV detection algorithm, HPV-KITE (HPV K-mer Index Tversky Estimator), which leverages k-mer data analysis and utilizes Tversky indexing for DNA and RNA sequence data. This method offers a rapid and sensitive alternative for detecting HPV from both metagenomic and transcriptomic datasets. We assessed HPV-KITE using three previously analyzed HPV infection-related datasets, comprising a total of 1430 sequenced human samples. For benchmarking, we compared our method's performance with standard HPV sequencing analysis algorithms, including general sequence-based mapping, and k-mer-based classification methods. Parallelization demonstrated fast processing times achieved through shingling, and scalability analysis revealed optimal performance when employing multiple nodes. Our results showed that HPV-KITE is one of the fastest, most accurate, and easiest ways to detect HPV genotypes from virtually any next-generation sequencing data. Moreover, the method is also highly scalable and available to be optimized for any microorganism other than HPV.
Collapse
Affiliation(s)
- Marek Nowicki
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń, ul. Chopina 12/18, PL-87-100 Toruń, Poland
| | - Magdalena Mroczek
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland
- Department of Biomedicine, University Hospital Basel, University of Basel, Hebelstrasse 20, CH-4031 Basel, Switzerland
| | - Dhananjay Mukhedkar
- Department of Clinical Science, Intervention and Technology, Forskningsgatan 56, Karolinska University Hospital, Karolinska Institutet, SE-14186 Stockholm, Sweden
- Hopsworks AB, Åsögatan 119, SE-116 24 Stockholm, Sweden
| | - Piotr Bała
- Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Tyniecka 15/17, PL-02-630 Warsaw, Poland
| | - Ville Nikolai Pimenoff
- Department of Clinical Science, Intervention and Technology, Forskningsgatan 56, Karolinska University Hospital, Karolinska Institutet, SE-14186 Stockholm, Sweden
- Research Unit of Population Health and Borealis Biobank, Faculty of Medicine, University of Oulu, Aapistie 5 B, FI-90014 University of Oulu, Finland
| | - Laila Sara Arroyo Mühr
- Department of Clinical Science, Intervention and Technology, Forskningsgatan 56, Karolinska University Hospital, Karolinska Institutet, SE-14186 Stockholm, Sweden
| |
Collapse
|
3
|
Koul M, Kaushik S, Singh K, Sharma D. VITALdb: to select the best viroinformatics tools for a desired virus or application. Brief Bioinform 2025; 26:bbaf084. [PMID: 40063348 PMCID: PMC11892104 DOI: 10.1093/bib/bbaf084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Revised: 01/14/2025] [Accepted: 02/17/2025] [Indexed: 05/13/2025] Open
Abstract
The recent pandemics of viral diseases, COVID-19/mpox (humans) and lumpy skin disease (cattle), have kept us glued to viral research. These pandemics along with the recent human metapneumovirus outbreak have exposed the urgency for early diagnosis of viral infections, vaccine development, and discovery of novel antiviral drugs and therapeutics. To support this, there is an armamentarium of virus-specific computational tools that are currently available. VITALdb (VIroinformatics Tools and ALgorithms database) is a resource of ~360 viroinformatics tools encompassing all major viruses (SARS-CoV-2, influenza virus, human immunodeficiency virus, papillomavirus, herpes simplex virus, hepatitis virus, dengue virus, Ebola virus, Zika virus, etc.) and several diverse applications [structural and functional annotation, antiviral peptides development, subspecies characterization, recognition of viral recombination, inhibitors identification, phylogenetic analysis, virus-host prediction, viral metagenomics, detection of mutation(s), primer designing, etc.]. Resources, tools, and other utilities mentioned in this article will not only facilitate further developments in the realm of viroinformatics but also provide tremendous fillip to translate fundamental knowledge into applied research. Most importantly, VITALdb is an inevitable tool for selecting the best tool(s) to carry out a desired task and hence will prove to be a vital database (VITALdb) for the scientific community. Database URL: https://compbio.iitr.ac.in/vitaldb.
Collapse
Affiliation(s)
- Mira Koul
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| | - Shalini Kaushik
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| | - Kavya Singh
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| | - Deepak Sharma
- Computational Biology and Translational Bioinformatics (CBTB) Laboratory, Department of Biosciences and Bioengineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India
| |
Collapse
|
4
|
Mehedi ST, Abdulrazak LF, Ahmed K, Uddin MS, Bui FM, Chen L, Moni MA, Al-Zahrani FA. A privacy-preserving dependable deep federated learning model for identifying new infections from genome sequences. Sci Rep 2025; 15:7291. [PMID: 40025035 PMCID: PMC11873272 DOI: 10.1038/s41598-025-89612-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Accepted: 02/06/2025] [Indexed: 03/04/2025] Open
Abstract
The traditional molecular-based identification (TMID) technique of new infections from genome sequences (GSs) has made significant contributions so far. However, due to the sensitive nature of the medical data, the TMID technique of transferring the patient's data to the central machine or server may create severe privacy and security issues. In recent years, the progression of deep federated learning (DFL) and its remarkable success in many domains has guided as a potential solution in this field. Therefore, we proposed a dependable and privacy-preserving DFL-based identification model of new infections from GSs. The unique contributions include automatic effective feature selection, which is best suited for identifying new infections, designing a dependable and privacy-preserving DFL-based LeNet model, and evaluating real-world data. To this end, a comprehensive experimental performance evaluation has been conducted. Our proposed model has an overall accuracy of 99.12% after independently and identically distributing the dataset among six clients. Moreover, the proposed model has a precision of 98.23%, recall of 98.04%, f1-score of 96.24%, Cohen's kappa of 83.94%, and ROC AUC of 98.24% for the same configuration, which is a noticeable improvement when compared to the other benchmark models. The proposed dependable model, along with empirical results, is encouraging enough to recognize as an alternative for identifying new infections from other virus strains by ensuring proper privacy and security of patients' data.
Collapse
Affiliation(s)
- Sk Tanzir Mehedi
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Lway Faisal Abdulrazak
- Electrical Engineering Technical College, Middle Technical University, Baghdad, Iraq
- Department of Computer Science, Cihan University Sulaimaniya, Sulaimaniya, Kurdistan Region, 46001, Iraq
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
- Group of Bio-Photomatiχ, Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh.
| | - Muhammad Shahin Uddin
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, Tangail, 1902, Bangladesh
| | - Francis M Bui
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Li Chen
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Mohammad Ali Moni
- AI and Digital Health Technology, Artificial Intelligence and Cyber Future Institute, Charles Sturt University, Bathurst, NSW, 2795, Australia
- AI and Digital Health Technology, Rural Health Research Institute, Charles Sturt University, Orange, NSW, 2800, Australia
| | | |
Collapse
|
5
|
Nawaz MS, Nawaz MZ, Junyi Z, Fournier-Viger P, Qu JF. Exploiting the sequential nature of genomic data for improved analysis and identification. Comput Biol Med 2024; 183:109307. [PMID: 39488052 DOI: 10.1016/j.compbiomed.2024.109307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 09/18/2024] [Accepted: 10/18/2024] [Indexed: 11/04/2024]
Abstract
Genomic data is growing exponentially, posing new challenges for sequence analysis and classification, particularly for managing and understanding harmful new viruses that may later cause pandemics. Recent genome sequence classification models yield promising performance. However, the majority of them do not consider the sequential arrangement of nucleotides and amino acids, a critical aspect for uncovering their inherent structure and function. To overcome this, we introduce GenoAnaCla, a novel approach for analyzing and classifying genome sequences, based on sequential pattern mining (SPM). The proposed approach first constructs and preprocesses datasets comprising RNA virus genome sequences in three formats: nucleotide, coding region, and protein. Then, to capture sequential features for the analysis and classification of viruses, GenoAnaCla extracts frequent sequential patterns and rules in three forms and in codons. Eight classifiers are utilized, and their effectiveness is assessed by employing a variety of evaluation metrics. A performance comparison demonstrates that the suggested approach surpasses the current state-of-the-art genome sequence classification and detection techniques with a 3.18% performance increase in accuracy on average.
Collapse
Affiliation(s)
- M Saqib Nawaz
- College of Computer Science and Software Engineering, Shenzhen University, China.
| | - M Zohaib Nawaz
- College of Computer Science and Software Engineering, Shenzhen University, China; Faculty of Computing and Information Technology, Department of Computer Science, University of Sargodha, Pakistan.
| | - Zhang Junyi
- College of Computer Science and Software Engineering, Shenzhen University, China.
| | | | - Jun-Feng Qu
- School of Computer Engineering, Hubei University of Arts and Science, Xiangyang, Hubei, China.
| |
Collapse
|
6
|
Zárate A, Díaz-González L, Taboada B. VirDetect-AI: a residual and convolutional neural network-based metagenomic tool for eukaryotic viral protein identification. Brief Bioinform 2024; 26:bbaf001. [PMID: 39808116 PMCID: PMC11729733 DOI: 10.1093/bib/bbaf001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2024] [Revised: 11/12/2024] [Accepted: 08/01/2025] [Indexed: 01/16/2025] Open
Abstract
This study addresses the challenging task of identifying viruses within metagenomic data, which encompasses a broad array of biological samples, including animal reservoirs, environmental sources, and the human body. Traditional methods for virus identification often face limitations due to the diversity and rapid evolution of viral genomes. In response, recent efforts have focused on leveraging artificial intelligence (AI) techniques to enhance accuracy and efficiency in virus detection. However, existing AI-based approaches are primarily binary classifiers, lacking specificity in identifying viral types and reliant on nucleotide sequences. To address these limitations, VirDetect-AI, a novel tool specifically designed for the identification of eukaryotic viruses within metagenomic datasets, is introduced. The VirDetect-AI model employs a combination of convolutional neural networks and residual neural networks to effectively extract hierarchical features and detailed patterns from complex amino acid genomic data. The results demonstrated that the model has outstanding results in all metrics, with a sensitivity of 0.97, a precision of 0.98, and an F1-score of 0.98. VirDetect-AI improves our comprehension of viral ecology and can accurately classify metagenomic sequences into 980 viral protein classes, hence enabling the identification of new viruses. These classes encompass an extensive array of viral genera and families, as well as protein functions and hosts.
Collapse
Affiliation(s)
- Alida Zárate
- Doctorado en Ciencias, Instituto de Investigación en Ciencias Básicas Aplicadas (IICBA), Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos 62210, México
| | - Lorena Díaz-González
- Centro de Investigación en Ciencias, Universidad Autónoma del Estado de Morelos, Cuernavaca, Morelos 62210, México
| | - Blanca Taboada
- Departamento de Genética del Desarrollo y Fisiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| |
Collapse
|
7
|
Yang M, Wang Z, Yan Z, Wang W, Zhu Q, Jin C. DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification. BMC Bioinformatics 2024; 25:328. [PMID: 39402441 PMCID: PMC11476100 DOI: 10.1186/s12859-024-05955-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 10/09/2024] [Indexed: 10/19/2024] Open
Abstract
BACKGROUND The rapid advancements in deep neural network models have significantly enhanced the ability to extract features from microbial sequence data, which is critical for addressing biological challenges. However, the scarcity and complexity of labeled microbial data pose substantial difficulties for supervised learning approaches. To address these issues, we propose DNASimCLR, an unsupervised framework designed for efficient gene sequence data feature extraction. RESULTS DNASimCLR leverages convolutional neural networks and the SimCLR framework, based on contrastive learning, to extract intricate features from diverse microbial gene sequences. Pre-training was conducted on two classic large scale unlabelled datasets encompassing metagenomes and viral gene sequences. Subsequent classification tasks were performed by fine-tuning the pretrained model using the previously acquired model. Our experiments demonstrate that DNASimCLR is at least comparable to state-of-the-art techniques for gene sequence classification. For convolutional neural network-based approaches, DNASimCLR surpasses the latest existing methods, clearly establishing its superiority over the state-of-the-art CNN-based feature extraction techniques. Furthermore, the model exhibits superior performance across diverse tasks in analyzing biological sequence data, showcasing its robust adaptability. CONCLUSIONS DNASimCLR represents a robust and database-agnostic solution for gene sequence classification. Its versatility allows it to perform well in scenarios involving novel or previously unseen gene sequences, making it a valuable tool for diverse applications in genomics.
Collapse
Affiliation(s)
- Minghao Yang
- Shandong University, Weihai, People's Republic of China
- Beijing Research Institute of Automation for Machinery Industry, Beijing, People's Republic of China
| | - Zehua Wang
- Beijing Research Institute of Automation for Machinery Industry, Beijing, People's Republic of China
| | - Zizhuo Yan
- Beijing Research Institute of Automation for Machinery Industry, Beijing, People's Republic of China
| | - Wenxiang Wang
- Beijing Research Institute of Automation for Machinery Industry, Beijing, People's Republic of China
| | - Qian Zhu
- Shandong University, Weihai, People's Republic of China
| | - Changlong Jin
- Shandong University, Weihai, People's Republic of China.
| |
Collapse
|
8
|
Azevedo KS, de Souza LC, Coutinho MGF, de M Barbosa R, Fernandes MAC. Deepvirusclassifier: a deep learning tool for classifying SARS-CoV-2 based on viral subtypes within the coronaviridae family. BMC Bioinformatics 2024; 25:231. [PMID: 38969970 PMCID: PMC11225326 DOI: 10.1186/s12859-024-05754-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 03/19/2024] [Indexed: 07/07/2024] Open
Abstract
PURPOSE In this study, we present DeepVirusClassifier, a tool capable of accurately classifying Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) viral sequences among other subtypes of the coronaviridae family. This classification is achieved through a deep neural network model that relies on convolutional neural networks (CNNs). Since viruses within the same family share similar genetic and structural characteristics, the classification process becomes more challenging, necessitating more robust models. With the rapid evolution of viral genomes and the increasing need for timely classification, we aimed to provide a robust and efficient tool that could increase the accuracy of viral identification and classification processes. Contribute to advancing research in viral genomics and assist in surveilling emerging viral strains. METHODS Based on a one-dimensional deep CNN, the proposed tool is capable of training and testing on the Coronaviridae family, including SARS-CoV-2. Our model's performance was assessed using various metrics, including F1-score and AUROC. Additionally, artificial mutation tests were conducted to evaluate the model's generalization ability across sequence variations. We also used the BLAST algorithm and conducted comprehensive processing time analyses for comparison. RESULTS DeepVirusClassifier demonstrated exceptional performance across several evaluation metrics in the training and testing phases. Indicating its robust learning capacity. Notably, during testing on more than 10,000 viral sequences, the model exhibited a more than 99% sensitivity for sequences with fewer than 2000 mutations. The tool achieves superior accuracy and significantly reduced processing times compared to the Basic Local Alignment Search Tool algorithm. Furthermore, the results appear more reliable than the work discussed in the text, indicating that the tool has great potential to revolutionize viral genomic research. CONCLUSION DeepVirusClassifier is a powerful tool for accurately classifying viral sequences, specifically focusing on SARS-CoV-2 and other subtypes within the Coronaviridae family. The superiority of our model becomes evident through rigorous evaluation and comparison with existing methods. Introducing artificial mutations into the sequences demonstrates the tool's ability to identify variations and significantly contributes to viral classification and genomic research. As viral surveillance becomes increasingly critical, our model holds promise in aiding rapid and accurate identification of emerging viral strains.
Collapse
Affiliation(s)
- Karolayne S Azevedo
- InovAI Lab, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil
| | - Luísa C de Souza
- InovAI Lab, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil
| | - Maria G F Coutinho
- InovAI Lab, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil
| | - Raquel de M Barbosa
- InovAI Lab, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil.
- Department of Pharmacy and Pharmaceutical Technology, University of Seville, 41012, Seville, Spain.
| | - Marcelo A C Fernandes
- InovAI Lab, nPITI/IMD, Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil.
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil.
- Department of Computer Engineering and Automation (DCA), Federal University of Rio Grande do Norte, Natal, RN, 59078-970, Brazil.
| |
Collapse
|
9
|
Gündüz HA, Mreches R, Moosbauer J, Robertson G, To XY, Franzosa EA, Huttenhower C, Rezaei M, McHardy AC, Bischl B, Münch PC, Binder M. Optimized model architectures for deep learning on genomic data. Commun Biol 2024; 7:516. [PMID: 38693292 PMCID: PMC11063068 DOI: 10.1038/s42003-024-06161-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 04/08/2024] [Indexed: 05/03/2024] Open
Abstract
The success of deep learning in various applications depends on task-specific architecture design choices, including the types, hyperparameters, and number of layers. In computational biology, there is no consensus on the optimal architecture design, and decisions are often made using insights from more well-established fields such as computer vision. These may not consider the domain-specific characteristics of genome sequences, potentially limiting performance. Here, we present GenomeNet-Architect, a neural architecture design framework that automatically optimizes deep learning models for genome sequence data. It optimizes the overall layout of the architecture, with a search space specifically designed for genomics. Additionally, it optimizes hyperparameters of individual layers and the model training procedure. On a viral classification task, GenomeNet-Architect reduced the read-level misclassification rate by 19%, with 67% faster inference and 83% fewer parameters, and achieved similar contig-level accuracy with ~100 times fewer parameters compared to the best-performing deep learning baselines.
Collapse
Affiliation(s)
- Hüseyin Anil Gündüz
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - René Mreches
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Julia Moosbauer
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Gary Robertson
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Xiao-Yin To
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Eric A Franzosa
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Mina Rezaei
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Alice C McHardy
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
- German Centre for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany
| | - Bernd Bischl
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Philipp C Münch
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA.
- German Centre for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany.
| | - Martin Binder
- Department of Statistics, LMU Munich, Munich, Germany.
- Munich Center for Machine Learning, Munich, Germany.
| |
Collapse
|
10
|
Liu G, Chen X, Luan Y, Li D. VirusPredictor: XGBoost-based software to predict virus-related sequences in human data. Bioinformatics 2024; 40:btae192. [PMID: 38597887 PMCID: PMC11052659 DOI: 10.1093/bioinformatics/btae192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024] Open
Abstract
MOTIVATION Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2000-5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150-350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. ∼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY AND IMPLEMENTATION www.dllab.org/software/VirusPredictor.html.
Collapse
Affiliation(s)
- Guangchen Liu
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, United States
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
- School of Mathematics and Statistics, Ludong University, Yantai, Shandong 264025, China
| | - Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, United States
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, Vermont 05405, United States
- Department of Immunology and Molecular Microbiology, Texas Tech University Health Sciences Center, Lubbock, Texas 79430, United States
- ICanCME Research Network, Sainte-Justine University Hospital Research Center, Montreal, Quebec H3T 1C5, Canada
| |
Collapse
|
11
|
Hegarty B, Riddell V J, Bastien E, Langenfeld K, Lindback M, Saini JS, Wing A, Zhang J, Duhaime M. Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods. mSystems 2024; 9:e0110523. [PMID: 38376167 PMCID: PMC10949488 DOI: 10.1128/msystems.01105-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 01/24/2024] [Indexed: 02/21/2024] Open
Abstract
Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj ≥ 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets. IMPORTANCE The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.
Collapse
Affiliation(s)
- Bridget Hegarty
- Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio, USA
| | - James Riddell V
- Department of Microbiology, The Ohio State University, Columbus, Ohio, USA
| | - Eric Bastien
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Kathryn Langenfeld
- Department of Civil and Environmental Engineering, Stanford University, Palo Alto, California, USA
| | - Morgan Lindback
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Jaspreet S. Saini
- Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Anthony Wing
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| | - Jessica Zhang
- Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Melissa Duhaime
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
12
|
Ming Z, Chen X, Wang S, Liu H, Yuan Z, Wu M, Xia H. HostNet: improved sequence representation in deep neural networks for virus-host prediction. BMC Bioinformatics 2023; 24:455. [PMID: 38041071 PMCID: PMC10691023 DOI: 10.1186/s12859-023-05582-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 11/24/2023] [Indexed: 12/03/2023] Open
Abstract
BACKGROUND The escalation of viruses over the past decade has highlighted the need to determine their respective hosts, particularly for emerging ones that pose a potential menace to the welfare of both human and animal life. Yet, the traditional means of ascertaining the host range of viruses, which involves field surveillance and laboratory experiments, is a laborious and demanding undertaking. A computational tool with the capability to reliably predict host ranges for novel viruses can provide timely responses in the prevention and control of emerging infectious diseases. The intricate nature of viral-host prediction involves issues such as data imbalance and deficiency. Therefore, developing highly accurate computational tools capable of predicting virus-host associations is a challenging and pressing demand. RESULTS To overcome the challenges of virus-host prediction, we present HostNet, a deep learning framework that utilizes a Transformer-CNN-BiGRU architecture and two enhanced sequence representation modules. The first module, k-mer to vector, pre-trains a background vector representation of k-mers from a broad range of virus sequences to address the issue of data deficiency. The second module, an adaptive sliding window, truncates virus sequences of various lengths to create a uniform number of informative and distinct samples for each sequence to address the issue of data imbalance. We assess HostNet's performance on a benchmark dataset of "Rabies lyssavirus" and an in-house dataset of "Flavivirus". Our results show that HostNet surpasses the state-of-the-art deep learning-based method in host-prediction accuracies and F1 score. The enhanced sequence representation modules, significantly improve HostNet's training generalization, performance in challenging classes, and stability. CONCLUSION HostNet is a promising framework for predicting virus hosts from genomic sequences, addressing challenges posed by sparse and varying-length virus sequence data. Our results demonstrate its potential as a valuable tool for virus-host prediction in various biological contexts. Virus-host prediction based on genomic sequences using deep neural networks is a promising approach to identifying their potential hosts accurately and efficiently, with significant impacts on public health, disease prevention, and vaccine development.
Collapse
Affiliation(s)
- Zhaoyan Ming
- School of Computer and Computing Science, Hangzhou City University, Hangzhou, 310015, China
| | - Xiangjun Chen
- Polytechnic Institute, Zhejiang University, Hangzhou, 310058, China
| | - Shunlong Wang
- Key Laboratory of Virology and Biosafety, Wuhan Institute of Virology, Wuhan, 430071, China
- University of Chinese Academy of Sciences, Beijing, 100190, China
| | - Hong Liu
- Institute of Biomedicine, Shandong University of Technology, Zibo, 255000, China
| | - Zhiming Yuan
- Key Laboratory of Virology and Biosafety, Wuhan Institute of Virology, Wuhan, 430071, China
- University of Chinese Academy of Sciences, Beijing, 100190, China
| | - Minghui Wu
- School of Computer and Computing Science, Hangzhou City University, Hangzhou, 310015, China.
| | - Han Xia
- Key Laboratory of Virology and Biosafety, Wuhan Institute of Virology, Wuhan, 430071, China.
- University of Chinese Academy of Sciences, Beijing, 100190, China.
- Hubei Jiangxia Laboratory, Wuhan, 430200, China.
| |
Collapse
|
13
|
Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. A toolbox of machine learning software to support microbiome analysis. Front Microbiol 2023; 14:1250806. [PMID: 38075858 PMCID: PMC10704913 DOI: 10.3389/fmicb.2023.1250806] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 05/14/2025] Open
Abstract
The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Víctor Manuel López-Molina
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Türkiye
| | - Marcus Frohme
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | | | - Thomas Klammsteiner
- Department of Microbiology and Department of Ecology, University of Innsbruck, Innsbruck, Austria
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | | | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Alina Nechyporenko
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- Department of Systems Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Alexia Sampri
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Blanca Lacruz-Pleguezuelos
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ricardo Araujo
- Nephrology and Infectious Diseases R & D Group, i3S—Instituto de Investigação e Inovação em Saúde; INEB—Instituto de Engenharia Biomédica, Universidade do Porto, Porto, Portugal
| | - Ioannis Anagnostopoulos
- Department of Informatics, University of Piraeus, Piraeus, Greece
- Computer Science and Biomedical Informatics Department, University of Thessaly, Lamia, Greece
| | - Önder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Türkiye
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - M. Luz Calle
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
- IRIS-CC, Fundació Institut de Recerca i Innovació en Ciències de la Vida i la Salut a la Catalunya Central, Vic, Barcelona, Spain
| | - Michelangelo Ceci
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Hatice Duman
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Aycan Gündoğdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Türkiye
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Türkiye
| | - Aki S. Havulinna
- Finnish Institute for Health and Welfare - THL, Helsinki, Finland
- Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland
| | | | - Eglantina Kalluci
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Sercan Karav
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Daniel Lode
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bram Nap
- School of Medicine, University of Galway, Galway, Ireland
| | - Miroslava Nedyalkova
- Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, Sofia, Bulgaria
| | - Inês Paciência
- Center for Environmental and Respiratory Health Research (CERH), Research Unit of Population Health, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Lejla Pasic
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Meritxell Pujolassos
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Antonio Susín
- Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
| | - Ines Thiele
- School of Medicine, University of Galway, Galway, Ireland
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Ciprian-Octavian Truică
- Computer Science and Engineering Department, Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica, Bucharest, Romania
| | - Paul Wilmes
- Systems Ecology Group, Luxembourg Centre for Systems Biomedicine, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Belvaux, Luxembourg
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Marcus Joakim Claesson
- APC Microbiome Ireland, University College Cork, Cork, Ireland
- School of Microbiology, University College Cork, Cork, Ireland
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | |
Collapse
|
14
|
Millan Arias P, Hill KA, Kari L. iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences. Bioinformatics 2023; 39:btad508. [PMID: 37589603 PMCID: PMC10483029 DOI: 10.1093/bioinformatics/btad508] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 07/18/2023] [Accepted: 08/16/2023] [Indexed: 08/18/2023] Open
Abstract
SUMMARY We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of ∼20%, and the two specialized algorithms by an average of ∼12%, on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabeled DNA sequences. AVAILABILITY AND IMPLEMENTATION iDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence.
Collapse
Affiliation(s)
- Pablo Millan Arias
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON N6A 5B7, Canada
| | - Lila Kari
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| |
Collapse
|
15
|
Miao Y, Bian J, Dong G, Dai T. DETIRE: a hybrid deep learning model for identifying viral sequences from metagenomes. Front Microbiol 2023; 14:1169791. [PMID: 37396369 PMCID: PMC10313334 DOI: 10.3389/fmicb.2023.1169791] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Accepted: 05/18/2023] [Indexed: 07/04/2023] Open
Abstract
A metagenome contains all DNA sequences from an environmental sample, including viruses, bacteria, archaea, and eukaryotes. Since viruses are of huge abundance and have caused vast mortality and morbidity to human society in history as a type of major pathogens, detecting viruses from metagenomes plays a crucial role in analyzing the viral component of samples and is the very first step for clinical diagnosis. However, detecting viral fragments directly from the metagenomes is still a tough issue because of the existence of a huge number of short sequences. In this study a hybrid Deep lEarning model for idenTifying vIral sequences fRom mEtagenomes (DETIRE) is proposed to solve the problem. First, the graph-based nucleotide sequence embedding strategy is utilized to enrich the expression of DNA sequences by training an embedding matrix. Then, the spatial and sequential features are extracted by trained CNN and BiLSTM networks, respectively, to enrich the features of short sequences. Finally, the two sets of features are weighted combined for the final decision. Trained by 220,000 sequences of 500 bp subsampled from the Virus and Host RefSeq genomes, DETIRE identifies more short viral sequences (<1,000 bp) than the three latest methods, such as DeepVirFinder, PPR-Meta, and CHEER. DETIRE is freely available at Github (https://github.com/crazyinter/DETIRE).
Collapse
|
16
|
Wang X, Li F, Teng Y, Ji C, Wu H. Characterization of oxidative damage induced by nanoparticles via mechanism-driven machine learning approaches. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 871:162103. [PMID: 36764549 DOI: 10.1016/j.scitotenv.2023.162103] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 01/19/2023] [Accepted: 02/04/2023] [Indexed: 06/18/2023]
Abstract
The wide application of TiO2-based engineered nanoparticles (nTiO2) inevitably led to release into aquatic ecosystems. Importantly, increasing studies have emphasized the high risks of nTiO2 to coastal environments. Bivalves, the representative benthic filter feeders in coastal zones, acted as important roles to assess and monitor the toxic effects of nanoparticles. Oxidative damage was one of the main toxic mechanisms of nTiO2 on bivalves, but the experimental variables/nanomaterial characteristics were diverse and the toxicity mechanism was complex. Therefore, it was very necessary to develop machine learning model to characterize and predict the potential toxicity. In this study, thirty-six machine learning models were built by nanodescriptors combined with six machine learning algorithms. Among them, random forest (RF) - catalase (CAT), k-neighbors classifier (KNN) - glutathione peroxidase (GPx), neural networks - multilayer perceptron (ANN) - glutathione s-transferase (GST), random forest (RF) - malondialdehyde (MDA), random forest (RF) - reactive oxygen species (ROS), and extreme gradient boosting decision tree (XGB) - superoxide dismutase (SOD) models performed good with high accuracy and balanced accuracy for both training sets and external validation sets. Furthermore, the best model revealed the predominant factors (exposure concentration, exposure periods, and exposure matrix) influencing the oxidative stress induced by nTiO2. These results showed that high exposure concentrations and short exposure-intervals tended to cause oxidative damage to bivalves. In addition, gills and digestive glands could be vulnerable to nTiO2-induced oxidative damage as tissues/organs differences were the important factors controlling MDA activity. This study provided insights into important nano-features responsible for the different indicators of oxidative stress and thereby extended the application of machine learning approaches in toxicological assessment for nanoparticles.
Collapse
Affiliation(s)
- Xiaoqing Wang
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; University of Chinese Academy of Sciences, Beijing 100049, PR China
| | - Fei Li
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao 266071, PR China.
| | - Yuefa Teng
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; University of Chinese Academy of Sciences, Beijing 100049, PR China
| | - Chenglong Ji
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao 266071, PR China
| | - Huifeng Wu
- CAS Key Laboratory of Coastal Environmental Processes and Ecological Remediation, Yantai Institute of Coastal Zone Research (YIC), Chinese Academy of Sciences (CAS), Shandong Key Laboratory of Coastal Environmental Processes, YICCAS, Yantai 264003, PR China; Center for Ocean Mega-Science, Chinese Academy of Sciences, Qingdao 266071, PR China
| |
Collapse
|
17
|
Ho SFS, Wheeler NE, Millard AD, van Schaik W. Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data. MICROBIOME 2023; 11:84. [PMID: 37085924 PMCID: PMC10120246 DOI: 10.1186/s40168-023-01533-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Accepted: 03/22/2023] [Indexed: 05/03/2023]
Abstract
BACKGROUND The prediction of bacteriophage sequences in metagenomic datasets has become a topic of considerable interest, leading to the development of many novel bioinformatic tools. A comparative analysis of ten state-of-the-art phage identification tools was performed to inform their usage in microbiome research. METHODS Artificial contigs generated from complete RefSeq genomes representing phages, plasmids, and chromosomes, and a previously sequenced mock community containing four phage species, were used to evaluate the precision, recall, and F1 scores of the tools. We also generated a dataset of randomly shuffled sequences to quantify false-positive calls. In addition, a set of previously simulated viromes was used to assess diversity bias in each tool's output. RESULTS VIBRANT and VirSorter2 achieved the highest F1 scores (0.93) in the RefSeq artificial contigs dataset, with several other tools also performing well. Kraken2 had the highest F1 score (0.86) in the mock community benchmark by a large margin (0.3 higher than DeepVirFinder in second place), mainly due to its high precision (0.96). Generally, k-mer-based tools performed better than reference similarity tools and gene-based methods. Several tools, most notably PPR-Meta, called a high number of false positives in the randomly shuffled sequences. When analysing the diversity of the genomes that each tool predicted from a virome set, most tools produced a viral genome set that had similar alpha- and beta-diversity patterns to the original population, with Seeker being a notable exception. CONCLUSIONS This study provides key metrics used to assess performance of phage detection tools, offers a framework for further comparison of additional viral discovery tools, and discusses optimal strategies for using these tools. We highlight that the choice of tool for identification of phages in metagenomic datasets, as well as their parameters, can bias the results and provide pointers for different use case scenarios. We have also made our benchmarking dataset available for download in order to facilitate future comparisons of phage identification tools. Video Abstract.
Collapse
Affiliation(s)
- Siu Fung Stanley Ho
- Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Nicole E. Wheeler
- Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Andrew D. Millard
- Department of Genetics and Genome Biology, University of Leicester, Leicester, UK
| | - Willem van Schaik
- Institute of Microbiology and Infection, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| |
Collapse
|
18
|
de Souza LC, Azevedo KS, de Souza JG, Barbosa RDM, Fernandes MAC. New proposal of viral genome representation applied in the classification of SARS-CoV-2 with deep learning. BMC Bioinformatics 2023; 24:92. [PMID: 36906520 PMCID: PMC10007673 DOI: 10.1186/s12859-023-05188-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 02/15/2023] [Indexed: 03/13/2023] Open
Abstract
BACKGROUND In December 2019, the first case of COVID-19 was described in Wuhan, China, and by July 2022, there were already 540 million confirmed cases. Due to the rapid spread of the virus, the scientific community has made efforts to develop techniques for the viral classification of SARS-CoV-2. RESULTS In this context, we developed a new proposal for gene sequence representation with Genomic Signal Processing techniques for the work presented in this paper. First, we applied the mapping approach to samples of six viral species of the Coronaviridae family, which belongs SARS-CoV-2 Virus. We then used the sequence downsized obtained by the method proposed in a deep learning architecture for viral classification, achieving an accuracy of 98.35%, 99.08%, and 99.69% for the 64, 128, and 256 sizes of the viral signatures, respectively, and obtaining 99.95% precision for the vectors with size 256. CONCLUSIONS The classification results obtained, in comparison to the results produced using other state-of-the-art representation techniques, demonstrate that the proposed mapping can provide a satisfactory performance result with low computational memory and processing time costs.
Collapse
Affiliation(s)
- Luísa C. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Karolayne S. Azevedo
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Jackson G. de Souza
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| | - Raquel de M. Barbosa
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, Granada, Spain
| | - Marcelo A. C. Fernandes
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal, RN 59078-970 Brazil
| |
Collapse
|
19
|
Elbasir A, Ye Y, Schäffer DE, Hao X, Wickramasinghe J, Tsingas K, Lieberman PM, Long Q, Morris Q, Zhang R, Schäffer AA, Auslander N. A deep learning approach reveals unexplored landscape of viral expression in cancer. Nat Commun 2023; 14:785. [PMID: 36774364 PMCID: PMC9922274 DOI: 10.1038/s41467-023-36336-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Accepted: 01/25/2023] [Indexed: 02/13/2023] Open
Abstract
About 15% of human cancer cases are attributed to viral infections. To date, virus expression in tumor tissues has been mostly studied by aligning tumor RNA sequencing reads to databases of known viruses. To allow identification of divergent viruses and rapid characterization of the tumor virome, we develop viRNAtrap, an alignment-free pipeline to identify viral reads and assemble viral contigs. We utilize viRNAtrap, which is based on a deep learning model trained to discriminate viral RNAseq reads, to explore viral expression in cancers and apply it to 14 cancer types from The Cancer Genome Atlas (TCGA). Using viRNAtrap, we uncover expression of unexpected and divergent viruses that have not previously been implicated in cancer and disclose human endogenous viruses whose expression is associated with poor overall survival. The viRNAtrap pipeline provides a way forward to study viral infections associated with different clinical conditions.
Collapse
Affiliation(s)
| | - Ying Ye
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Daniel E Schäffer
- The Wistar Institute, Philadelphia, PA, 19104, USA.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Xue Hao
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | | | - Konstantinos Tsingas
- The Wistar Institute, Philadelphia, PA, 19104, USA.,University of Pennsylvania, Philadelphia, PA, USA
| | | | - Qi Long
- University of Pennsylvania, Philadelphia, PA, USA
| | - Quaid Morris
- Computational and Systems Biology, Sloan Kettering Institute, New York City, NY, 10065, USA
| | - Rugang Zhang
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory (CDSL), National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | | |
Collapse
|
20
|
Schackart KE, Graham JB, Ponsero AJ, Hurwitz BL. Evaluation of computational phage detection tools for metagenomic datasets. Front Microbiol 2023; 14:1078760. [PMID: 36760501 PMCID: PMC9902911 DOI: 10.3389/fmicb.2023.1078760] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 01/09/2023] [Indexed: 01/25/2023] Open
Abstract
Introduction As new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks. Methods In this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools. Results We find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.
Collapse
Affiliation(s)
- Kenneth E. Schackart
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
| | - Jessica B. Graham
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| | - Alise J. Ponsero
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
- Human Microbiome Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Bonnie L. Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| |
Collapse
|
21
|
Bajiya N, Dhall A, Aggarwal S, Raghava GPS. Advances in the field of phage-based therapy with special emphasis on computational resources. Brief Bioinform 2023; 24:6961791. [PMID: 36575815 DOI: 10.1093/bib/bbac574] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 11/07/2022] [Accepted: 11/25/2022] [Indexed: 12/29/2022] Open
Abstract
In the current era, one of the major challenges is to manage the treatment of drug/antibiotic-resistant strains of bacteria. Phage therapy, a century-old technique, may serve as an alternative to antibiotics in treating bacterial infections caused by drug-resistant strains of bacteria. In this review, a systematic attempt has been made to summarize phage-based therapy in depth. This review has been divided into the following two sections: general information and computer-aided phage therapy (CAPT). In the case of general information, we cover the history of phage therapy, the mechanism of action, the status of phage-based products (approved and clinical trials) and the challenges. This review emphasizes CAPT, where we have covered primary phage-associated resources, phage prediction methods and pipelines. This review covers a wide range of databases and resources, including viral genomes and proteins, phage receptors, host genomes of phages, phage-host interactions and lytic proteins. In the post-genomic era, identifying the most suitable phage for lysing a drug-resistant strain of bacterium is crucial for developing alternate treatments for drug-resistant bacteria and this remains a challenging problem. Thus, we compile all phage-associated prediction methods that include the prediction of phages for a bacterial strain, the host for a phage and the identification of interacting phage-host pairs. Most of these methods have been developed using machine learning and deep learning techniques. This review also discussed recent advances in the field of CAPT, where we briefly describe computational tools available for predicting phage virions, the life cycle of phages and prophage identification. Finally, we describe phage-based therapy's advantages, challenges and opportunities.
Collapse
Affiliation(s)
- Nisha Bajiya
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Anjali Dhall
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Suchet Aggarwal
- Department of Computer Science and Engineering, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| | - Gajendra P S Raghava
- Department of Computational Biology, Indraprastha Institute of Information Technology, Okhla Phase 3, New Delhi, 110020, India
| |
Collapse
|
22
|
He S, Gao B, Sabnis R, Sun Q. RNAdegformer: accurate prediction of mRNA degradation at nucleotide resolution with deep learning. Brief Bioinform 2023; 24:bbac581. [PMID: 36633966 PMCID: PMC9851316 DOI: 10.1093/bib/bbac581] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 11/14/2022] [Accepted: 11/28/2022] [Indexed: 01/13/2023] Open
Abstract
Messenger RNA-based therapeutics have shown tremendous potential, as demonstrated by the rapid development of messenger RNA based vaccines for COVID-19. Nevertheless, distribution of mRNA vaccines worldwide has been hampered by mRNA's inherent thermal instability due to in-line hydrolysis, a chemical degradation reaction. Therefore, predicting and understanding RNA degradation is a crucial and urgent task. Here we present RNAdegformer, an effective and interpretable model architecture that excels in predicting RNA degradation. RNAdegformer processes RNA sequences with self-attention and convolutions, two deep learning techniques that have proved dominant in the fields of computer vision and natural language processing, while utilizing biophysical features of RNA. We demonstrate that RNAdegformer outperforms previous best methods at predicting degradation properties at nucleotide resolution for COVID-19 mRNA vaccines. RNAdegformer predictions also exhibit improved correlation with RNA in vitro half-life compared with previous best methods. Additionally, we showcase how direct visualization of self-attention maps assists informed decision-making. Further, our model reveals important features in determining mRNA degradation rates via leave-one-feature-out analysis.
Collapse
Affiliation(s)
- Shujun He
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Baizhen Gao
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Rushant Sabnis
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| | - Qing Sun
- Department of Chemical Engineering, Texas A&M University, 100 Spence St, 77843, Texas, United States
| |
Collapse
|
23
|
Coutinho MG, Câmara GB, Barbosa RDM, Fernandes MA. SARS-CoV-2 virus classification based on stacked sparse autoencoder. Comput Struct Biotechnol J 2022; 21:284-298. [PMID: 36530948 PMCID: PMC9742810 DOI: 10.1016/j.csbj.2022.12.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 12/04/2022] [Accepted: 12/05/2022] [Indexed: 12/13/2022] Open
Abstract
Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infection diagnosis, metagenomics, phylogenetics, and analysis. Considering that motivation, the authors proposed an efficient viral genome classifier for the SARS-CoV-2 using the deep neural network based on the stacked sparse autoencoder (SSAE). For the best performance of the model, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation was applied. We performed four experiments to provide different levels of taxonomic classification of the SARS-CoV-2. The SSAE technique provided great performance results in all experiments, achieving classification accuracy between 92% and 100% for the validation set and between 98.9% and 100% when the SARS-CoV-2 samples were applied for the test set. In this work, samples of the SARS-CoV-2 were not used during the training process, only during subsequent tests, in which the model was able to infer the correct classification of the samples in the vast majority of cases. This indicates that our model can be adapted to classify other emerging viruses. Finally, the results indicated the applicability of this deep learning technique in genome classification problems.
Collapse
Affiliation(s)
- Maria G.F. Coutinho
- Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal, Brazil
| | - Gabriel B.M. Câmara
- Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal, Brazil
| | - Raquel de M. Barbosa
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, 18071 Granada, Spain
| | - Marcelo A.C. Fernandes
- Laboratory of Machine Learning and Intelligent Instrumentation, IMD/nPITI, Federal University of Rio Grande do Norte, Natal, Brazil
- Department of Computer and Automation Engineering, Federal University of Rio Grande do Norte, Natal, Brazil
| |
Collapse
|
24
|
Uddin M, Islam MK, Hassan MR, Jahan F, Baek JH. A fast and efficient algorithm for DNA sequence similarity identification. COMPLEX INTELL SYST 2022; 9:1265-1280. [PMID: 36035628 PMCID: PMC9395857 DOI: 10.1007/s40747-022-00846-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 08/05/2022] [Indexed: 11/22/2022]
Abstract
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer. We develop an efficient system for finding the positions of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
Collapse
|
25
|
Câmara GBM, Coutinho MGF, da Silva LMD, Gadelha WVDN, Torquato MF, Barbosa RDM, Fernandes MAC. Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification. SENSORS (BASEL, SWITZERLAND) 2022; 22:5730. [PMID: 35957287 PMCID: PMC9371030 DOI: 10.3390/s22155730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 07/28/2022] [Accepted: 07/28/2022] [Indexed: 06/15/2023]
Abstract
COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 ± 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.
Collapse
Affiliation(s)
- Gabriel B. M. Câmara
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Maria G. F. Coutinho
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Lucileide M. D. da Silva
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Federal Institute of Education, Science and Technology of Rio Grande do Norte, Paraiso, Santa Cruz 59200-000, RN, Brazil
| | - Walter V. do N. Gadelha
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Matheus F. Torquato
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Raquel de M. Barbosa
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, 18071 Granada, Spain
| | - Marcelo A. C. Fernandes
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil
| |
Collapse
|
26
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
27
|
Krishnamoorthy M, Ranjan P, Erb-Downward JR, Dickson RP, Wiens J. AMAISE: a machine learning approach to index-free sequence enrichment. Commun Biol 2022; 5:568. [PMID: 35681015 PMCID: PMC9184628 DOI: 10.1038/s42003-022-03498-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 05/18/2022] [Indexed: 11/21/2022] Open
Abstract
Metagenomics holds potential to improve clinical diagnostics of infectious diseases, but DNA from clinical specimens is often dominated by host-derived sequences. To address this, researchers employ host-depletion methods. Laboratory-based host-depletion methods, however, are costly in terms of time and effort, while computational host-depletion methods rely on memory-intensive reference index databases and struggle to accurately classify noisy sequence data. To solve these challenges, we propose an index-free tool, AMAISE (A Machine Learning Approach to Index-Free Sequence Enrichment). Applied to the task of separating host from microbial reads, AMAISE achieves over 98% accuracy. Applied prior to metagenomic classification, AMAISE results in a 14-18% decrease in memory usage compared to using metagenomic classification alone. Our results show that a reference-independent machine learning approach to host depletion allows for accurate and efficient sequence detection.
Collapse
Affiliation(s)
- Meera Krishnamoorthy
- Division of Computer Science and Engineering, Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
| | - Piyush Ranjan
- Division of Pulmonary & Critical Care Medicine, Department of Medicine, University of Michigan, Ann Arbor, MI, USA
| | - John R Erb-Downward
- Division of Pulmonary & Critical Care Medicine, Department of Medicine, University of Michigan, Ann Arbor, MI, USA
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
| | - Robert P Dickson
- Division of Pulmonary & Critical Care Medicine, Department of Medicine, University of Michigan, Ann Arbor, MI, USA
- Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
- Max Harry Weil Institute for Critical Care Research and Innovation, University of Michigan, Ann Arbor, MI, USA
| | - Jenna Wiens
- Division of Computer Science and Engineering, Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
28
|
Rickert CA, Lieleg O. Machine learning approaches for biomolecular, biophysical, and biomaterials research. BIOPHYSICS REVIEWS 2022; 3:021306. [PMID: 38505413 PMCID: PMC10914139 DOI: 10.1063/5.0082179] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 05/12/2022] [Indexed: 03/21/2024]
Abstract
A fluent conversation with a virtual assistant, person-tailored news feeds, and deep-fake images created within seconds-all those things that have been unthinkable for a long time are now a part of our everyday lives. What these examples have in common is that they are realized by different means of machine learning (ML), a technology that has fundamentally changed many aspects of the modern world. The possibility to process enormous amount of data in multi-hierarchical, digital constructs has paved the way not only for creating intelligent systems but also for obtaining surprising new insight into many scientific problems. However, in the different areas of biosciences, which typically rely heavily on the collection of time-consuming experimental data, applying ML methods is a bit more challenging: Here, difficulties can arise from small datasets and the inherent, broad variability, and complexity associated with studying biological objects and phenomena. In this Review, we give an overview of commonly used ML algorithms (which are often referred to as "machines") and learning strategies as well as their applications in different bio-disciplines such as molecular biology, drug development, biophysics, and biomaterials science. We highlight how selected research questions from those fields were successfully translated into machine readable formats, discuss typical problems that can arise in this context, and provide an overview of how to resolve those encountered difficulties.
Collapse
|
29
|
Sukhorukov G, Khalili M, Gascuel O, Candresse T, Marais-Colombel A, Nikolski M. VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data. FRONTIERS IN BIOINFORMATICS 2022; 2:867111. [PMID: 36304258 PMCID: PMC9580956 DOI: 10.3389/fbinf.2022.867111] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 03/24/2022] [Indexed: 10/15/2023] Open
Abstract
High-throughput sequencing has provided the capacity of broad virus detection for both known and unknown viruses in a variety of hosts and habitats. It has been successfully applied for novel virus discovery in many agricultural crops, leading to the current drive to apply this technology routinely for plant health diagnostics. For this, efficient and precise methods for sequencing-based virus detection and discovery are essential. However, both existing alignment-based methods relying on reference databases and even more recent machine learning approaches are not efficient enough in detecting unknown viruses in RNAseq datasets of plant viromes. We present VirHunter, a deep learning convolutional neural network approach, to detect novel and known viruses in assemblies of sequencing datasets. While our method is generally applicable to a variety of viruses, here, we trained and evaluated it specifically for RNA viruses by reinforcing the coding sequences' content in the training dataset. Trained on the NCBI plant viruses data for three different host species (peach, grapevine, and sugar beet), VirHunter outperformed the state-of-the-art method, DeepVirFinder, for the detection of novel viruses, both in the synthetic leave-out setting and on the 12 newly acquired RNAseq datasets. Compared with the traditional tBLASTx approach, VirHunter has consistently exhibited better results in the majority of leave-out experiments. In conclusion, we have shown that VirHunter can be used to streamline the analyses of plant HTS-acquired viromes and is particularly well suited for the detection of novel viral contigs, in RNAseq datasets.
Collapse
Affiliation(s)
- Grigorii Sukhorukov
- CNRS, IBGC, UMR 5095, Université de Bordeaux, Bordeaux, France
- Bordeaux Bioinformatics Center, Université de Bordeaux, Bordeaux, France
| | - Maryam Khalili
- Université de Bordeaux, INRAE, UMR BFP, CS20032, CEDEX, Villenave d’Ornon, France
| | - Olivier Gascuel
- Institut de Systématique, Biodiversité, Evolution (ISYEB - UMR7205, Muséum National d’Histoire Naturelle, CNRS, SU, EPHE, UA), Paris, France
| | - Thierry Candresse
- Université de Bordeaux, INRAE, UMR BFP, CS20032, CEDEX, Villenave d’Ornon, France
| | | | - Macha Nikolski
- CNRS, IBGC, UMR 5095, Université de Bordeaux, Bordeaux, France
- Bordeaux Bioinformatics Center, Université de Bordeaux, Bordeaux, France
| |
Collapse
|
30
|
Liu F, Miao Y, Liu Y, Hou T. RNN-VirSeeker: A Deep Learning Method for Identification of Short Viral Sequences From Metagenomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1840-1849. [PMID: 33315571 DOI: 10.1109/tcbb.2020.3044575] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Viruses are the most abundant biological entities on earth, and play vital roles in many aspects of microbial communities. As major human pathogens, viruses have caused huge mortality and morbidity to human society in history. Metagenomic sequencing methods could capture all microorganisms from microbiota, with sequences of viruses mixed with these of other species. Therefore, it is necessary to identify viral sequences from metagenomes. However, existing methods perform poorly on identifying short viral sequences. To solve this problem, a deep learning based method, RNN-VirSeeker, is proposed in this paper. RNN-VirSeeker was trained by sequences of 500bp sampled from known Virus and Host RefSeq genomes. Experimental results on the testing set have shown that RNN-VirSeeker exhibited AUROC of 0.9175, recall of 0.8640 and precision of 0.9211 for sequences of 500bp, and outperformed three widely used methods, VirSorter, VirFinder, and DeepVirFinder, on identifying short viral sequences. RNN-VirSeeker was also used to identify viral sequences from a CAMI dataset and a human gut metagenome. Compared with DeepVirFinder, RNN-VirSeeker identified more viral sequences from these metagenomes and achieved greater values of AUPRC and AUROC. RNN-VirSeeker is freely available at https://github.com/crazyinter/RNN-VirSeeker.
Collapse
|
31
|
Miao Y, Liu F, Hou T, Liu Y. Virtifier: a deep learning-based identifier for viral sequences from metagenomes. Bioinformatics 2022; 38:1216-1222. [PMID: 34908121 DOI: 10.1093/bioinformatics/btab845] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 11/13/2021] [Accepted: 12/13/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Viruses, the most abundant biological entities on earth, are important components of microbial communities, and as major human pathogens, they are responsible for human mortality and morbidity. The identification of viral sequences from metagenomes is critical for viral analysis. As massive quantities of short sequences are generated by next-generation sequencing, most methods utilize discrete and sparse one-hot vectors to encode nucleotide sequences, which are usually ineffective in viral identification. RESULTS In this article, Virtifier, a deep learning-based viral identifier for sequences from metagenomic data is proposed. It includes a meaningful nucleotide sequence encoding method named Seq2Vec and a variant viral sequence predictor with an attention-based long short-term memory (LSTM) network. By utilizing a fully trained embedding matrix to encode codons, Seq2Vec can efficiently extract the relationships among those codons in a nucleotide sequence. Combined with an attention layer, the LSTM neural network can further analyze the codon relationships and sift the parts that contribute to the final features. Experimental results of three datasets have shown that Virtifier can accurately identify short viral sequences (<500 bp) from metagenomes, surpassing three widely used methods, VirFinder, DeepVirFinder and PPR-Meta. Meanwhile, a comparable performance was achieved by Virtifier at longer lengths (>5000 bp). AVAILABILITY AND IMPLEMENTATION A Python implementation of Virtifier and the Python code developed for this study have been provided on Github https://github.com/crazyinter/Seq2Vec. The RefSeq genomes in this article are available in VirFinder at https://dx.doi.org/10.1186/s40168-017-0283-5. The CAMI Challenge Dataset 3 CAMI_high dataset in this article is available in CAMI at https://data.cami-challenge.org/participate. The real human gut metagenomes in this article are available at https://dx.doi.org/10.1101/gr.142315.112. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yan Miao
- College of Communication Engineering, Jilin University, Changchun 130022, China
| | - Fu Liu
- College of Communication Engineering, Jilin University, Changchun 130022, China
| | - Tao Hou
- College of Communication Engineering, Jilin University, Changchun 130022, China
| | - Yun Liu
- College of Communication Engineering, Jilin University, Changchun 130022, China
| |
Collapse
|
32
|
Millán Arias P, Alipour F, Hill KA, Kari L. DeLUCS: Deep learning for unsupervised clustering of DNA sequences. PLoS One 2022; 17:e0261531. [PMID: 35061715 PMCID: PMC8782307 DOI: 10.1371/journal.pone.0261531] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 12/06/2021] [Indexed: 11/25/2022] Open
Abstract
We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Kathleen A. Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
33
|
Rani G, Oza MG, Dhaka VS, Pradhan N, Verma S, Rodrigues JJPC. Applying deep learning-based multi-modal for detection of coronavirus. MULTIMEDIA SYSTEMS 2022; 28:1251-1262. [PMID: 34305327 PMCID: PMC8294320 DOI: 10.1007/s00530-021-00824-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 06/20/2021] [Indexed: 05/11/2023]
Abstract
Amidst the global pandemic and catastrophe created by 'COVID-19', every research institution and scientist are doing their best efforts to invent or find the vaccine or medicine for the disease. The objective of this research is to design and develop a deep learning-based multi-modal for the screening of COVID-19 using chest radiographs and genomic sequences. The modal is also effective in finding the degree of genomic similarity among the Severe Acute Respiratory Syndrome-Coronavirus 2 and other prevalent viruses such as Severe Acute Respiratory Syndrome-Coronavirus, Middle East Respiratory Syndrome-Coronavirus, Human Immunodeficiency Virus, and Human T-cell Leukaemia Virus. The experimental results on the datasets available at National Centre for Biotechnology Information, GitHub, and Kaggle repositories show that it is successful in detecting the genome of 'SARS-CoV-2' in the host genome with an accuracy of 99.27% and screening of chest radiographs into COVID-19, non-COVID pneumonia and healthy with a sensitivity of 95.47%. Thus, it may prove a useful tool for doctors to quickly classify the infected and non-infected genomes. It can also be useful in finding the most effective drug from the available drugs for the treatment of 'COVID-19'.
Collapse
Affiliation(s)
- Geeta Rani
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, Rajasthan India
| | - Meet Ganpatlal Oza
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, Rajasthan India
| | - Vijaypal Singh Dhaka
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, Rajasthan India
| | - Nitesh Pradhan
- Department of Computer Science and Engineering, Manipal University Jaipur, Jaipur, Rajasthan India
| | - Sahil Verma
- Department of Computer Science and Engineering, Chandigarh University, Mohali, 140413 India
| | - Joel J. P. C. Rodrigues
- Federal University of Piauí (UFPI) Teresina, Teresina, PI Brazil
- Instituto de Telecomunicações, Aveiro, Portugal
| |
Collapse
|
34
|
Deif MA, Solyman AAA, Kamarposhti MA, Band SS, Hammam RE. A deep bidirectional recurrent neural network for identification of SARS-CoV-2 from viral genome sequences. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:8933-8950. [PMID: 34814329 DOI: 10.3934/mbe.2021440] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this work, Deep Bidirectional Recurrent Neural Networks (BRNNs) models were implemented based on both Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) cells in order to distinguish between genome sequence of SARS-CoV-2 and other Corona Virus strains such as SARS-CoV and MERS-CoV, Common Cold and other Acute Respiratory Infection (ARI) viruses. An investigation of the hyper-parameters including the optimizer type and the number of unit cells, was also performed to attain the best performance of the BRNN models. Results showed that the GRU BRNNs model was able to discriminate between SARS-CoV-2 and other classes of viruses with a higher overall classification accuracy of 96.8% as compared to that of the LSTM BRNNs model having a 95.8% overall classification accuracy. The best hyper-parameters producing the highest performance for both models was obtained when applying the SGD optimizer and an optimum number of unit cells of 80 in both models. This study proved that the proposed GRU BRNN model has a better classification ability for SARS-CoV-2 thus providing an efficient tool to help in containing the disease and achieving better clinical decisions with high precision.
Collapse
Affiliation(s)
- Mohanad A Deif
- Department of Bioelectronics, Modern University of Technology and Information (MTI) University, Cairo 11571, Egypt
| | - Ahmed A A Solyman
- Department of Electrical and Electronics Engineering, Istanbul Gelisim University, Avcılar 34310, Turkey
| | | | - Shahab S Band
- Future Technology Research Center, College of Future, National Yunlin University of Science and Technology, 123 University Road, Yunlin 64002, Taiwan
| | - Rania E Hammam
- Department of Bioelectronics, Modern University of Technology and Information (MTI) University, Cairo 11571, Egypt
| |
Collapse
|
35
|
Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples. Viruses 2021; 13:v13102006. [PMID: 34696436 PMCID: PMC8541124 DOI: 10.3390/v13102006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/30/2021] [Accepted: 10/02/2021] [Indexed: 12/27/2022] Open
Abstract
According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.
Collapse
|
36
|
Garneau JR, Legrand V, Marbouty M, Press MO, Vik DR, Fortier LC, Sullivan MB, Bikard D, Monot M. High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome. Sci Rep 2021; 11:18319. [PMID: 34526611 PMCID: PMC8443750 DOI: 10.1038/s41598-021-97867-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Accepted: 08/27/2021] [Indexed: 11/13/2022] Open
Abstract
Viruses that infect bacteria (phages) are increasingly recognized for their importance in diverse ecosystems but identifying and annotating them in large-scale sequence datasets is still challenging. Although efficient scalable virus identification tools are emerging, defining the exact ends (termini) of phage genomes is still particularly difficult. The proper identification of termini is crucial, as it helps in characterizing the packaging mechanism of bacteriophages and provides information on various aspects of phage biology. Here, we introduce PhageTermVirome (PTV) as a tool for the easy and rapid high-throughput determination of phage termini and packaging mechanisms using modern large-scale metagenomics datasets. We successfully tested the PTV algorithm on a mock virome dataset and then used it on two real virome datasets to achieve the rapid identification of more than 100 phage termini and packaging mechanisms, with just a few hours of computing time. Because PTV allows the identification of free fully formed viral particles (by recognition of termini present only in encapsidated DNA), it can also complement other virus identification softwares to predict the true viral origin of contigs in viral metagenomics datasets. PTV is a novel and unique tool for high-throughput characterization of phage genomes, including phage termini identification and characterization of genome packaging mechanisms. This software should help researchers better visualize, map and study the virosphere. PTV is freely available for downloading and installation at https://gitlab.pasteur.fr/vlegrand/ptv.
Collapse
Affiliation(s)
| | - Véronique Legrand
- Infrastructure et Ingénierie Scientifique, Institut Pasteur, 75015, Paris, France
| | - Martial Marbouty
- Institut Pasteur, Unité Régulation Spatiale des Génomes, UMR 3525, CNRS, 75015, Paris, France
| | | | - Dean R Vik
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
| | - Louis-Charles Fortier
- Faculty of Medicine and Health Sciences, Department of Microbiology and Infectious Diseases, Université de Sherbrooke, Sherbrooke, QC, J1E 4K8, Canada
| | - Matthew B Sullivan
- Department of Microbiology, Ohio State University, Columbus, OH, 43210, USA
| | - David Bikard
- Département de Microbiologie, Institut Pasteur, Groupe Biologie de Synthèse, 75015, Paris, France
| | - Marc Monot
- Biomics Platform, C2RT, Institut Pasteur, 75015, Paris, France.
| |
Collapse
|
37
|
Yakimovich A. Machine Learning and Artificial Intelligence for the Prediction of Host-Pathogen Interactions: A Viral Case. Infect Drug Resist 2021; 14:3319-3326. [PMID: 34456575 PMCID: PMC8385421 DOI: 10.2147/idr.s292743] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2021] [Accepted: 08/03/2021] [Indexed: 01/27/2023] Open
Abstract
The research of interactions between the pathogens and their hosts is key for understanding the biology of infection. Commencing on the level of individual molecules, these interactions define the behavior of infectious agents and the outcomes they elicit. Discovery of host-pathogen interactions (HPIs) conventionally involves a stepwise laborious research process. Yet, amid the global pandemic the urge for rapid discovery acceleration through the novel computational methodologies has become ever so poignant. This review explores the challenges of HPI discovery and investigates the efforts currently undertaken to apply the latest machine learning (ML) and artificial intelligence (AI) methodologies to this field. This includes applications to molecular and genetic data, as well as image and language data. Furthermore, a number of breakthroughs, obstacles, along with prospects of AI for host-pathogen interactions (HPI), are discussed.
Collapse
|
38
|
Dasari CM, Bhukya R. Explainable deep neural networks for novel viral genome prediction. APPL INTELL 2021; 52:3002-3017. [PMID: 34764607 PMCID: PMC8232563 DOI: 10.1007/s10489-021-02572-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/26/2021] [Indexed: 11/27/2022]
Abstract
Viral infection causes a wide variety of human diseases including cancer and COVID-19. Viruses invade host cells and associate with host molecules, potentially disrupting the normal function of hosts that leads to fatal diseases. Novel viral genome prediction is crucial for understanding the complex viral diseases like AIDS and Ebola. While most existing computational techniques classify viral genomes, the efficiency of the classification depends solely on the structural features extracted. The state-of-the-art DNN models achieved excellent performance by automatic extraction of classification features, but the degree of model explainability is relatively poor. During model training for viral prediction, proposed CNN, CNN-LSTM based methods (EdeepVPP, EdeepVPP-hybrid) automatically extracts features. EdeepVPP also performs model interpretability in order to extract the most important patterns that cause viral genomes through learned filters. It is an interpretable CNN model that extracts vital biologically relevant patterns (features) from feature maps of viral sequences. The EdeepVPP-hybrid predictor outperforms all the existing methods by achieving 0.992 mean AUC-ROC and 0.990 AUC-PR on 19 human metagenomic contig experiment datasets using 10-fold cross-validation. We evaluate the ability of CNN filters to detect patterns across high average activation values. To further asses the robustness of EdeepVPP model, we perform leave-one-experiment-out cross-validation. It can work as a recommendation system to further analyze the raw sequences labeled as ‘unknown’ by alignment-based methods. We show that our interpretable model can extract patterns that are considered to be the most important features for predicting virus sequences through learned filters.
Collapse
Affiliation(s)
| | - Raju Bhukya
- National Institute of Technology, Warangal, Telangana 506004 India
| |
Collapse
|
39
|
Ma H, Tan TW, Ban KHK. A multi-task CNN learning model for taxonomic assignment of human viruses. BMC Bioinformatics 2021; 22:194. [PMID: 34078269 PMCID: PMC8170063 DOI: 10.1186/s12859-021-04084-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 03/16/2021] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. RESULTS In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. CONCLUSIONS We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus .
Collapse
Affiliation(s)
- Haoran Ma
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117592 Singapore, Singapore
| | - Tin Wee Tan
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117592 Singapore, Singapore
- National Supercomputing Centre (NSCC), 138632 Singapore, Singapore
| | - Kenneth Hon Kim Ban
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117592 Singapore, Singapore
- National Supercomputing Centre (NSCC), 138632 Singapore, Singapore
| |
Collapse
|
40
|
Stearrett N, Dawson T, Rahnavard A, Bachali P, Bendall ML, Zeng C, Caricchio R, Pérez-Losada M, Grammer AC, Lipsky PE, Crandall KA. Expression of Human Endogenous Retroviruses in Systemic Lupus Erythematosus: Multiomic Integration With Gene Expression. Front Immunol 2021; 12:661437. [PMID: 33986751 PMCID: PMC8112243 DOI: 10.3389/fimmu.2021.661437] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Accepted: 04/12/2021] [Indexed: 11/20/2022] Open
Abstract
Systemic lupus erythematosus (SLE) is a chronic autoimmune disease characterized by the production of autoantibodies predominantly to nuclear material. Many aspects of disease pathology are mediated by the deposition of nucleic acid containing immune complexes, which also induce the type 1interferon response, a characteristic feature of SLE. Notably, SLE is remarkably heterogeneous, with a variety of organs involved in different individuals, who also show variation in disease severity related to their ancestries. Here, we probed one potential contribution to disease heterogeneity as well as a possible source of immunoreactive nucleic acids by exploring the expression of human endogenous retroviruses (HERVs). We investigated the expression of HERVs in SLE and their potential relationship to SLE features and the expression of biochemical pathways, including the interferon gene signature (IGS). Towards this goal, we analyzed available and new RNA-Seq data from two independent whole blood studies using Telescope. We identified 481 locus specific HERV encoding regions that are differentially expressed between case and control individuals with only 14% overlap of differentially expressed HERVs between these two datasets. We identified significant differences between differentially expressed HERVs and non-differentially expressed HERVs between the two datasets. We also characterized the host differentially expressed genes and tested their association with the differentially expressed HERVs. We found that differentially expressed HERVs were significantly more physically proximal to host differentially expressed genes than non-differentially expressed HERVs. Finally, we capitalized on locus specific resolution of HERV mapping to identify key molecular pathways impacted by differential HERV expression in people with SLE.
Collapse
Affiliation(s)
- Nathaniel Stearrett
- Computational Biology Institute, George Washington University, Washington, DC, United States
| | - Tyson Dawson
- Computational Biology Institute, George Washington University, Washington, DC, United States
| | - Ali Rahnavard
- Computational Biology Institute, George Washington University, Washington, DC, United States
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States
| | - Prathyusha Bachali
- RILITE Research Institute and AMPEL BioSolutions, Charlottesville, VA, United States
| | - Matthew L. Bendall
- Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Chen Zeng
- Department of Physics, The George Washington University, Washington, DC, United States
| | - Roberto Caricchio
- Lewis Katz School of Medicine, Temple University, Philadelphia, PA, United States
| | - Marcos Pérez-Losada
- Computational Biology Institute, George Washington University, Washington, DC, United States
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Amrie C. Grammer
- RILITE Research Institute and AMPEL BioSolutions, Charlottesville, VA, United States
| | - Peter E. Lipsky
- RILITE Research Institute and AMPEL BioSolutions, Charlottesville, VA, United States
| | - Keith A. Crandall
- Computational Biology Institute, George Washington University, Washington, DC, United States
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States
| |
Collapse
|
41
|
Kaden M, Bohnsack KS, Weber M, Kudła M, Gutowska K, Blazewicz J, Villmann T. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences. Neural Comput Appl 2021; 34:67-78. [PMID: 33935376 PMCID: PMC8076884 DOI: 10.1007/s00521-021-06018-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 04/07/2021] [Indexed: 02/06/2023]
Abstract
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
Collapse
Affiliation(s)
- Marika Kaden
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Katrin Sophie Bohnsack
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mirko Weber
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mateusz Kudła
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
| | - Kaja Gutowska
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Jacek Blazewicz
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Thomas Villmann
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| |
Collapse
|
42
|
Kutnjak D, Tamisier L, Adams I, Boonham N, Candresse T, Chiumenti M, De Jonghe K, Kreuze JF, Lefebvre M, Silva G, Malapi-Wight M, Margaria P, Mavrič Pleško I, McGreig S, Miozzi L, Remenant B, Reynard JS, Rollin J, Rott M, Schumpp O, Massart S, Haegeman A. A Primer on the Analysis of High-Throughput Sequencing Data for Detection of Plant Viruses. Microorganisms 2021; 9:841. [PMID: 33920047 PMCID: PMC8071028 DOI: 10.3390/microorganisms9040841] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Revised: 04/09/2021] [Accepted: 04/10/2021] [Indexed: 12/12/2022] Open
Abstract
High-throughput sequencing (HTS) technologies have become indispensable tools assisting plant virus diagnostics and research thanks to their ability to detect any plant virus in a sample without prior knowledge. As HTS technologies are heavily relying on bioinformatics analysis of the huge amount of generated sequences, it is of utmost importance that researchers can rely on efficient and reliable bioinformatic tools and can understand the principles, advantages, and disadvantages of the tools used. Here, we present a critical overview of the steps involved in HTS as employed for plant virus detection and virome characterization. We start from sample preparation and nucleic acid extraction as appropriate to the chosen HTS strategy, which is followed by basic data analysis requirements, an extensive overview of the in-depth data processing options, and taxonomic classification of viral sequences detected. By presenting the bioinformatic tools and a detailed overview of the consecutive steps that can be used to implement a well-structured HTS data analysis in an easy and accessible way, this paper is targeted at both beginners and expert scientists engaging in HTS plant virome projects.
Collapse
Affiliation(s)
- Denis Kutnjak
- Department of Biotechnology and Systems Biology, National Institute of Biology, Večna pot 111, 1000 Ljubljana, Slovenia
| | - Lucie Tamisier
- Plant Pathology Laboratory, Université de Liège, Gembloux Agro-Bio Tech, TERRA, Passage des Déportés, 2, 5030 Gembloux, Belgium; (L.T.); (J.R.); (S.M.)
| | - Ian Adams
- Fera Science Limited, York YO41 1LZ, UK; (I.A.); (S.M.)
| | - Neil Boonham
- Institute for Agri-Food Research and Innovation, Newcastle University, King’s Rd, Newcastle Upon Tyne NE1 7RU, UK;
| | - Thierry Candresse
- UMR 1332 Biologie du Fruit et Pathologie, INRA, University of Bordeaux, 33140 Villenave d’Ornon, France; (T.C.); (M.L.)
| | - Michela Chiumenti
- Institute for Sustainable Plant Protection, National Research Council, Via Amendola, 122/D, 70126 Bari, Italy;
| | - Kris De Jonghe
- Plant Sciences Unit, Flanders Research Institute for Agriculture, Fisheries and Food, Burg. Van Gansberghelaan 96, 9820 Merelbeke, Belgium; (K.D.J.); (A.H.)
| | - Jan F. Kreuze
- International Potato Center (CIP), Avenida la Molina 1895, La Molina, Lima 15023, Peru;
| | - Marie Lefebvre
- UMR 1332 Biologie du Fruit et Pathologie, INRA, University of Bordeaux, 33140 Villenave d’Ornon, France; (T.C.); (M.L.)
| | - Gonçalo Silva
- Natural Resources Institute, University of Greenwich, Central Avenue, Chatham Maritime, Kent ME4 4TB, UK;
| | - Martha Malapi-Wight
- Biotechnology Risk Analysis Programs, Biotechnology Regulatory Services, Animal and Plant Health Inspection Service, U.S. Department of Agriculture, Riverdale, MD 20737, USA;
| | - Paolo Margaria
- Leibniz Institute-DSMZ, Inhoffenstrasse 7b, 38124 Braunschweig, Germany;
| | - Irena Mavrič Pleško
- Agricultural Institute of Slovenia, Hacquetova Ulica 17, 1000 Ljubljana, Slovenia;
| | - Sam McGreig
- Fera Science Limited, York YO41 1LZ, UK; (I.A.); (S.M.)
| | - Laura Miozzi
- Institute for Sustainable Plant Protection, National Research Council of Italy (IPSP-CNR), Strada delle Cacce 73, 10135 Torino, Italy;
| | - Benoit Remenant
- ANSES Plant Health Laboratory, 7 Rue Jean Dixméras, CEDEX 01, 49044 Angers, France;
| | | | - Johan Rollin
- Plant Pathology Laboratory, Université de Liège, Gembloux Agro-Bio Tech, TERRA, Passage des Déportés, 2, 5030 Gembloux, Belgium; (L.T.); (J.R.); (S.M.)
- DNAVision, 6041 Charleroi, Belgium
| | - Mike Rott
- Sidney Laboratory, Canadian Food Inspection Agency, 8801 East Saanich Rd, North Saanich, BC V8L 1H3, Canada;
| | - Olivier Schumpp
- Agroscope, Route de Duillier 50, 1260 Nyon, Switzerland; (J.-S.R.); (O.S.)
| | - Sébastien Massart
- Plant Pathology Laboratory, Université de Liège, Gembloux Agro-Bio Tech, TERRA, Passage des Déportés, 2, 5030 Gembloux, Belgium; (L.T.); (J.R.); (S.M.)
| | - Annelies Haegeman
- Plant Sciences Unit, Flanders Research Institute for Agriculture, Fisheries and Food, Burg. Van Gansberghelaan 96, 9820 Merelbeke, Belgium; (K.D.J.); (A.H.)
| |
Collapse
|
43
|
Bartoszewicz JM, Seidel A, Renard BY. Interpretable detection of novel human viruses from genome sequencing data. NAR Genom Bioinform 2021; 3:lqab004. [PMID: 33554119 PMCID: PMC7849996 DOI: 10.1093/nargab/lqab004] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 01/04/2021] [Accepted: 01/15/2021] [Indexed: 01/21/2023] Open
Abstract
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| | - Anja Seidel
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| |
Collapse
|
44
|
Acera Mateos P, Balboa RF, Easteal S, Eyras E, Patel HR. PACIFIC: a lightweight deep-learning classifier of SARS-CoV-2 and co-infecting RNA viruses. Sci Rep 2021; 11:3209. [PMID: 33547380 PMCID: PMC7864945 DOI: 10.1038/s41598-021-82043-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 01/12/2021] [Indexed: 01/30/2023] Open
Abstract
Viral co-infections occur in COVID-19 patients, potentially impacting disease progression and severity. However, there is currently no dedicated method to identify viral co-infections in patient RNA-seq data. We developed PACIFIC, a deep-learning algorithm that accurately detects SARS-CoV-2 and other common RNA respiratory viruses from RNA-seq data. Using in silico data, PACIFIC recovers the presence and relative concentrations of viruses with > 99% precision and recall. PACIFIC accurately detects SARS-CoV-2 and other viral infections in 63 independent in vitro cell culture and patient datasets. PACIFIC is an end-to-end tool that enables the systematic monitoring of viral infections in the current global pandemic.
Collapse
Affiliation(s)
- Pablo Acera Mateos
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, ACT 2600 Australia
| | - Renzo F. Balboa
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 2600 Australia
| | - Simon Easteal
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 2600 Australia
| | - Eduardo Eyras
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, ACT 2600 Australia
- IMIM - Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain
- Catalan Institution for Research and Advanced Studies, 08010 Barcelona, Spain
| | - Hardip R. Patel
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 2600 Australia
| |
Collapse
|
45
|
Sharma H, Drukker L, Chatelain P, Droste R, Papageorghiou AT, Noble JA. Knowledge representation and learning of operator clinical workflow from full-length routine fetal ultrasound scan videos. Med Image Anal 2021; 69:101973. [PMID: 33550004 DOI: 10.1016/j.media.2021.101973] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2020] [Revised: 11/18/2020] [Accepted: 01/11/2021] [Indexed: 12/25/2022]
Abstract
Ultrasound is a widely used imaging modality, yet it is well-known that scanning can be highly operator-dependent and difficult to perform, which limits its wider use in clinical practice. The literature on understanding what makes clinical sonography hard to learn and how sonography varies in the field is sparse, restricted to small-scale studies on the effectiveness of ultrasound training schemes, the role of ultrasound simulation in training, and the effect of introducing scanning guidelines and standards on diagnostic image quality. The Big Data era, and the recent and rapid emergence of machine learning as a more mainstream large-scale data analysis technique, presents a fresh opportunity to study sonography in the field at scale for the first time. Large-scale analysis of video recordings of full-length routine fetal ultrasound scans offers the potential to characterise differences between the scanning proficiency of experts and trainees that would be tedious and time-consuming to do manually due to the vast amounts of data. Such research would be informative to better understand operator clinical workflow when conducting ultrasound scans to support skills training, optimise scan times, and inform building better user-machine interfaces. This paper is to our knowledge the first to address sonography data science, which we consider in the context of second-trimester fetal sonography screening. Specifically, we present a fully-automatic framework to analyse operator clinical workflow solely from full-length routine second-trimester fetal ultrasound scan videos. An ultrasound video dataset containing more than 200 hours of scan recordings was generated for this study. We developed an original deep learning method to temporally segment the ultrasound video into semantically meaningful segments (the video description). The resulting semantic annotation was then used to depict operator clinical workflow (the knowledge representation). Machine learning was applied to the knowledge representation to characterise operator skills and assess operator variability. For video description, our best-performing deep spatio-temporal network shows favourable results in cross-validation (accuracy: 91.7%), statistical analysis (correlation: 0.98, p < 0.05) and retrospective manual validation (accuracy: 76.4%). For knowledge representation of operator clinical workflow, a three-level abstraction scheme consisting of a Subject-specific Timeline Model (STM), Summary of Timeline Features (STF), and an Operator Graph Model (OGM), was introduced that led to a significant decrease in dimensionality and computational complexity compared to raw video data. The workflow representations were learnt to discriminate between operator skills, where a proposed convolutional neural network-based model showed most promising performance (cross-validation accuracy: 98.5%, accuracy on unseen operators: 76.9%). These were further used to derive operator-specific scanning signatures and operator variability in terms of type, order and time distribution of constituent tasks.
Collapse
Affiliation(s)
- Harshita Sharma
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom.
| | - Lior Drukker
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, United Kingdom
| | - Pierre Chatelain
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Richard Droste
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Aris T Papageorghiou
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, United Kingdom
| | - J Alison Noble
- Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
46
|
Boeckaerts D, Stock M, Criel B, Gerstmans H, De Baets B, Briers Y. Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci Rep 2021; 11:1467. [PMID: 33446856 PMCID: PMC7809048 DOI: 10.1038/s41598-021-81063-4] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 12/30/2020] [Indexed: 12/04/2022] Open
Abstract
Nowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.
Collapse
Affiliation(s)
- Dimitri Boeckaerts
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Michiel Stock
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bjorn Criel
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Hans Gerstmans
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
- Laboratory of Gene Technology, Department of Biosystems, KU Leuven, Leuven, Belgium
- MeBioS-Biosensors group, Department of BioSystems, KU Leuven, Leuven, Belgium
| | - Bernard De Baets
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Yves Briers
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium.
| |
Collapse
|
47
|
Lopez-Rincon A, Tonda A, Mendoza-Maldonado L, Mulders DGJC, Molenkamp R, Perez-Romero CA, Claassen E, Garssen J, Kraneveld AD. Classification and specific primer design for accurate detection of SARS-CoV-2 using deep learning. Sci Rep 2021; 11:947. [PMID: 33441822 PMCID: PMC7806918 DOI: 10.1038/s41598-020-80363-5] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 12/21/2020] [Indexed: 02/07/2023] Open
Abstract
In this paper, deep learning is coupled with explainable artificial intelligence techniques for the discovery of representative genomic sequences in SARS-CoV-2. A convolutional neural network classifier is first trained on 553 sequences from the National Genomics Data Center repository, separating the genome of different virus strains from the Coronavirus family with 98.73% accuracy. The network's behavior is then analyzed, to discover sequences used by the model to identify SARS-CoV-2, ultimately uncovering sequences exclusive to it. The discovered sequences are validated on samples from the National Center for Biotechnology Information and Global Initiative on Sharing All Influenza Data repositories, and are proven to be able to separate SARS-CoV-2 from different virus strains with near-perfect accuracy. Next, one of the sequences is selected to generate a primer set, and tested against other state-of-the-art primer sets, obtaining competitive results. Finally, the primer is synthesized and tested on patient samples (n = 6 previously tested positive), delivering a sensitivity similar to routine diagnostic methods, and 100% specificity. The proposed methodology has a substantial added value over existing methods, as it is able to both automatically identify promising primer sets for a virus from a limited amount of data, and deliver effective results in a minimal amount of time. Considering the possibility of future pandemics, these characteristics are invaluable to promptly create specific detection methods for diagnostics.
Collapse
Affiliation(s)
- Alejandro Lopez-Rincon
- Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, Utrecht University, Universiteitsweg 99, 3584 CG, Utrecht, The Netherlands.
| | - Alberto Tonda
- UMR 518 MIA-Paris, INRAE, c/o 113 rue Nationale, 75103, Paris, France
| | - Lucero Mendoza-Maldonado
- Hospital Civil de Guadalajara "Dr. Juan I. Menchaca", Salvador Quevedo y Zubieta 750, Independencia Oriente, C.P. 44340, Guadalajara, Jalisco, México
| | | | - Richard Molenkamp
- Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Carmina A Perez-Romero
- Departamento de Investigación, Universidad Central de Queretaro (UNICEQ), Av. 5 de Febrero 1602, San Pablo, 76130, Santiago de Querétaro, QRO, Mexico
| | - Eric Claassen
- Athena Institute, Vrije Universiteit, De Boelelaan 1085, 1081 HV, Amsterdam, The Netherlands
| | - Johan Garssen
- Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, Utrecht University, Universiteitsweg 99, 3584 CG, Utrecht, The Netherlands
- Department Immunology, Danone Nutricia research, Uppsalalaan 12, 3584 CT, Utrecht, The Netherlands
| | - Aletta D Kraneveld
- Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, Utrecht University, Universiteitsweg 99, 3584 CG, Utrecht, The Netherlands
| |
Collapse
|
48
|
Chen X, Li D. Sequencing facility and DNA source associated patterns of virus-mappable reads in whole-genome sequencing data. Genomics 2021; 113:1189-1198. [PMID: 33301893 PMCID: PMC7856238 DOI: 10.1016/j.ygeno.2020.12.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Revised: 11/25/2020] [Accepted: 12/04/2020] [Indexed: 12/12/2022]
Abstract
Numerous viral sequences have been reported in the whole-genome sequencing (WGS) data of human blood. However, it is not clear to what degree the virus-mappable reads represent true viral sequences rather than random-mapping or noise originating from sample preparation, sequencing processes, or other sources. Identification of patterns of virus-mappable reads may generate novel indicators for evaluating the origins of these viral sequences. We characterized paired-end unmapped reads and reads aligned to viral references in human WGS datasets, then compared patterns of the virus-mappable reads among DNA sources and sequencing facilities which produced these datasets. We then examined potential origins of the source- and facility-associated viral reads. The proportions of clean unmapped reads among the seven sequencing facilities were significantly different (P < 2 × 10-16). We identified 260,339 reads that were mappable to a total of 99 viral references in 2535 samples. The majority (86.7%) of these virus-mappable reads (corresponding to 47 viral references), which can be classified into four groups based on their distinct patterns, were strongly associated with sequencing facility or DNA source (adjusted P value <0.01). Possible origins of these reads include artificial sequences in library preparation, recombinant vectors in cell culture, and phages co-contaminated with their host bacteria. The sequencing facility-associated virus-mappable reads and patterns were repeatedly observed in other datasets produced in the same facilities. We have constructed an analytic framework and profiled the unmapped reads mappable to viral references. The results provide a new understanding of sequencing facility- and DNA source-associated batch effects in deep sequencing data and may facilitate improved bioinformatics filtering of reads.
Collapse
Affiliation(s)
- Xun Chen
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA
| | - Dawei Li
- Department of Microbiology and Molecular Genetics, University of Vermont, Burlington, VT 05405, USA; Department of Computer Science, University of Vermont, Burlington, VT 05405, USA; Neuroscience, Behavior, Health Initiative, University of Vermont, Burlington, VT 05405, USA.
| |
Collapse
|
49
|
Soft Computing in Bioinformatics. Adv Bioinformatics 2021. [DOI: 10.1007/978-981-33-6191-1_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
50
|
MirLocPredictor: A ConvNet-Based Multi-Label MicroRNA Subcellular Localization Predictor by Incorporating k-Mer Positional Information. Genes (Basel) 2020; 11:genes11121475. [PMID: 33316943 PMCID: PMC7763197 DOI: 10.3390/genes11121475] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 11/23/2020] [Accepted: 11/25/2020] [Indexed: 02/06/2023] Open
Abstract
MicroRNAs (miRNA) are small noncoding RNA sequences consisting of about 22 nucleotides that are involved in the regulation of almost 60% of mammalian genes. Presently, there are very limited approaches for the visualization of miRNA locations present inside cells to support the elucidation of pathways and mechanisms behind miRNA function, transport, and biogenesis. MIRLocator, a state-of-the-art tool for the prediction of subcellular localization of miRNAs makes use of a sequence-to-sequence model along with pretrained k-mer embeddings. Existing pretrained k-mer embedding generation methodologies focus on the extraction of semantics of k-mers. However, in RNA sequences, positional information of nucleotides is more important because distinct positions of the four nucleotides define the function of an RNA molecule. Considering the importance of the nucleotide position, we propose a novel approach (kmerPR2vec) which is a fusion of positional information of k-mers with randomly initialized neural k-mer embeddings. In contrast to existing k-mer-based representation, the proposed kmerPR2vec representation is much more rich in terms of semantic information and has more discriminative power. Using novel kmerPR2vec representation, we further present an end-to-end system (MirLocPredictor) which couples the discriminative power of kmerPR2vec with Convolutional Neural Networks (CNNs) for miRNA subcellular location prediction. The effectiveness of the proposed kmerPR2vec approach is evaluated with deep learning-based topologies (i.e., Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN)) and by using 9 different evaluation measures. Analysis of the results reveals that MirLocPredictor outperform state-of-the-art methods with a significant margin of 18% and 19% in terms of precision and recall.
Collapse
|