1
|
Xing J, Yang S, Liang Y, Hu P, Dai B, Li H, Xing Y, Zuo Y. Deciphering Sequence Determinants of Zygotic Genome Activation Genes: Insights From Machine Learning and the ZGAExplorer Platform. Cell Prolif 2025:e70039. [PMID: 40251810 DOI: 10.1111/cpr.70039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Revised: 03/11/2025] [Accepted: 03/28/2025] [Indexed: 04/21/2025] Open
Abstract
The mammalian life cycle initiates with the transition of genetic control from the maternal to the embryonic genome during zygotic genome activation (ZGA), which becomes pivotal for development. Nevertheless, understanding the conservation of genes and transcription factors (TFs) that underlie mammalian ZGA remains limited. Here, we compiled a comprehensive set of ZGA genes from mice, humans, pigs, bovines and goats, including Nr5a2 and TPRX1/2. The identification of 111 homologous genes through comparative analyses was followed by the discovery of a conserved genetic coding region, suggesting potential sequence preferences for ZGA genes. Notably, an interpretable machine learning model based on k-mer core features showed excellent performance in predicting ZGA genes (area under the ROC curve [AUC] > 0.81), revealing abundant and intricate 6-base sequence specific patterns and potential binding TFs, including motifs from NR5A2 and TPRX1/2. Further analysis demonstrated that gene sequence features and epigenetic modification features play equally important roles in regulating ZGA genes. Ultimately, we developed the ZGAExplorer platform to provide an invaluable resource for screening ZGA genes. Our study unravels the sequence determinants of ZGA genes across species through multi-omics data integration and machine learning, yielding insights into ZGA regulatory mechanisms and embryonic developmental arrest.
Collapse
Affiliation(s)
- Jixiang Xing
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Siqi Yang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Yuchao Liang
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Pengwei Hu
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Bingjie Dai
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Hanshuang Li
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| | - Yongqiang Xing
- School of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, China
| | - Yongchun Zuo
- State Key Laboratory of Reproductive Regulation and Breeding of Grassland Livestock, Institutes of Biomedical Sciences, School of Life Sciences, Inner Mongolia University, Hohhot, China
| |
Collapse
|
2
|
Tian Q, Zhang P, Zhai Y, Wang Y, Zou Q. Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data. Genome Biol Evol 2024; 16:evae102. [PMID: 38748485 PMCID: PMC11135637 DOI: 10.1093/gbe/evae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.
Collapse
Affiliation(s)
- Qinzhong Tian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Pinglu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yixiao Zhai
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003 China
| |
Collapse
|
3
|
Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024; 40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open
Abstract
MOTIVATION Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.
Collapse
Affiliation(s)
- Edo Dotan
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Gal Jaschek
- Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, United States
| | - Tal Pupko
- The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | - Yonatan Belinkov
- The Henry and Marilyn Taub Faculty of Computer Science, Technion – Israel Institute of Technology, Haifa 3200003, Israel
| |
Collapse
|
4
|
de Andrade AAS, Grivet M, Brustolini O, Vasconcelos ATR. ( m, n)-mer-a simple statistical feature for sequence classification. BIOINFORMATICS ADVANCES 2023; 3:vbad088. [PMID: 37448814 PMCID: PMC10338135 DOI: 10.1093/bioadv/vbad088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2023] [Revised: 06/22/2023] [Accepted: 07/10/2023] [Indexed: 07/15/2023]
Abstract
Summary The (m, n)-mer is a simple alternative classification feature based on conditional probability distributions. In this application note, we compared k-mer and (m, n)-mer frequency features in 11 distinct datasets used for binary, multiclass and clustering classifications. Our findings show that the (m, n)-mer frequency features are related to the highest performance metrics and often statistically outperformed the k-mers. Here, the (m, n)-mer frequencies improved performance for classifying smaller sequence lengths (as short as 300 bp) and yielded higher metrics when using short values of k (ranging from 2 to 4). Therefore, we present the (m, n)-mers frequencies to the scientific community as a feature that seems to be quite effective in identifying complex discriminatory patterns and classifying polyphyletic sequence groups. Availability and implementation The (m, n)-mer algorithm is released as an R package within the CRAN project (https://cran.r-project.org/web/packages/mnmer) and is also available at https://github.com/labinfo-lncc/mnmer. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Amanda Araújo Serrão de Andrade
- Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333—Quitandinha, 25651-076, Rio de Janeiro, Brazil
| | - Marco Grivet
- Pontifícia Universidade Católica do Rio de Janeiro, Rua Marquês de São Vicente 225, Gávea, 22451-900, Rio de Janeiro, Brazil
| | - Otávio Brustolini
- Bioinformatics Laboratory (LABINFO), National Laboratory for Scientific Computing, Av. Getulio Vargas, 333—Quitandinha, 25651-076, Rio de Janeiro, Brazil
| | | |
Collapse
|
5
|
Abstract
In order to survey noroviruses in our environment, it is essential that both wet-lab and computational methods are fit for purpose. Using a simulated sequencing data set, denoising-based (DADA2, Deblur and USEARCH-UNOISE3) and clustering-based pipelines (VSEARCH and FROGS) were compared with respect to their ability to represent composition and sequence information. Open source classifiers (Ribosomal Database Project [RDP], BLASTn, IDTAXA, QIIME2 naive Bayes, and SINTAX) were trained using three different databases: a custom database, the NoroNet database, and the Human calicivirus database. Each classifier and database combination was compared from the perspective of their classification accuracy. VSEARCH provides a robust option for analyzing viral amplicons based on composition analysis; however, all pipelines could return OTUs with high similarity to the expected sequences. Importantly, pipeline choice could lead to more false positives (DADA2) or underclassification (FROGS), a key aspect when considering pipeline application for source attribution. Classification was more strongly impacted by the classifier than the database, although disagreement increased with norovirus GII.4 capsid variant designation. We recommend the use of the RDP classifier in conjunction with VSEARCH; however, maintenance of the underlying database is essential for optimal use. IMPORTANCE In benchmarking bioinformatic pipelines for analyzing high-throughput sequencing (HTS) data sets, we provide method standardization for bioinformatics broadly and specifically for norovirus in situations for which no officially endorsed methods exist at present. This study provides recommendations for the appropriate analysis and classification of norovirus amplicon HTS data and will be widely applicable during outbreak investigations.
Collapse
|
6
|
Cho KT, Sen TZ, Andorf CM. Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach. Front Artif Intell 2022; 5:830170. [PMID: 35719692 PMCID: PMC9204276 DOI: 10.3389/frai.2022.830170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Accepted: 04/26/2022] [Indexed: 11/13/2022] Open
Abstract
Machine learning and modeling approaches have been used to classify protein sequences for a broad set of tasks including predicting protein function, structure, expression, and localization. Some recent studies have successfully predicted whether a given gene is expressed as mRNA or even translated to proteins potentially, but given that not all genes are expressed in every condition and tissue, the challenge remains to predict condition-specific expression. To address this gap, we developed a machine learning approach to predict tissue-specific gene expression across 23 different tissues in maize, solely based on DNA promoter and protein sequences. For class labels, we defined high and low expression levels for mRNA and protein abundance and optimized classifiers by systematically exploring various methods and combinations of k-mer sequences in a two-phase approach. In the first phase, we developed Markov model classifiers for each tissue and built a feature vector based on the predictions. In the second phase, the feature vector was used as an input to a Bayesian network for final classification. Our results show that these methods can achieve high classification accuracy of up to 95% for predicting gene expression for individual tissues. By relying on sequence alone, our method works in settings where costly experimental data are unavailable and reveals useful insights into the functional, evolutionary, and regulatory characteristics of genes.
Collapse
Affiliation(s)
- Kyoung Tak Cho
- Department of Computer Science, Iowa State University, Ames, IA, United States
| | - Taner Z. Sen
- USDA-ARS, Crop Improvement and Genetics Research Unit, Albany, CA, United States
| | - Carson M. Andorf
- USDA-ARS, Corn Insects and Crop Genetics Research Unit, Ames, IA, United States
- *Correspondence: Carson M. Andorf
| |
Collapse
|
7
|
Sokhansanj BA, Rosen GL. Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences. mSystems 2022; 7:e0003522. [PMID: 35311562 PMCID: PMC9040592 DOI: 10.1128/msystems.00035-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2022] [Indexed: 12/22/2022] Open
Abstract
Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces "black box" models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.
Collapse
Affiliation(s)
- Bahrad A. Sokhansanj
- Drexel University, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA
| | - Gail L. Rosen
- Drexel University, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA
| |
Collapse
|
8
|
Peng Z, Maciel-Guerra A, Baker M, Zhang X, Hu Y, Wang W, Rong J, Zhang J, Xue N, Barrow P, Renney D, Stekel D, Williams P, Liu L, Chen J, Li F, Dottorini T. Whole-genome sequencing and gene sharing network analysis powered by machine learning identifies antibiotic resistance sharing between animals, humans and environment in livestock farming. PLoS Comput Biol 2022; 18:e1010018. [PMID: 35333870 PMCID: PMC8986120 DOI: 10.1371/journal.pcbi.1010018] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 04/06/2022] [Accepted: 03/14/2022] [Indexed: 01/26/2023] Open
Abstract
Anthropogenic environments such as those created by intensive farming of livestock, have been proposed to provide ideal selection pressure for the emergence of antimicrobial-resistant Escherichia coli bacteria and antimicrobial resistance genes (ARGs) and spread to humans. Here, we performed a longitudinal study in a large-scale commercial poultry farm in China, collecting E. coli isolates from both farm and slaughterhouse; targeting animals, carcasses, workers and their households and environment. By using whole-genome phylogenetic analysis and network analysis based on single nucleotide polymorphisms (SNPs), we found highly interrelated non-pathogenic and pathogenic E. coli strains with phylogenetic intermixing, and a high prevalence of shared multidrug resistance profiles amongst livestock, human and environment. Through an original data processing pipeline which combines omics, machine learning, gene sharing network and mobile genetic elements analysis, we investigated the resistance to 26 different antimicrobials and identified 361 genes associated to antimicrobial resistance (AMR) phenotypes; 58 of these were known AMR-associated genes and 35 were associated to multidrug resistance. We uncovered an extensive network of genes, correlated to AMR phenotypes, shared among livestock, humans, farm and slaughterhouse environments. We also found several human, livestock and environmental isolates sharing closely related mobile genetic elements carrying ARGs across host species and environments. In a scenario where no consensus exists on how antibiotic use in the livestock may affect antibiotic resistance in the human population, our findings provide novel insights into the broader epidemiology of antimicrobial resistance in livestock farming. Moreover, our original data analysis method has the potential to uncover AMR transmission pathways when applied to the study of other pathogens active in other anthropogenic environments characterised by complex interconnections between host species. Livestock have been suggested as an important source of antimicrobial-resistant (AMR) Escherichia coli, capable of infecting humans and carrying resistance to drugs used in human medicine. China has a large intensive livestock farming industry, poultry being the second most important source of meat in the country, and is the largest user of antibiotics for food production in the world. Here we studied antimicrobial resistance gene overlap between E. coli isolates collected from humans, livestock and their shared environments in a large-scale Chinese poultry farm and associated slaughterhouse. By using a computational approach that integrates machine learning, whole-genome sequencing, gene sharing network and mobile genetic elements analysis we characterized the E. coli community structure, antimicrobial resistance phenotypes and the genetic relatedness of non-pathogenic and pathogenic E. coli strains. We uncovered the network of genes, associated with AMR, shared across host species (animals and workers) and environments (farm and slaughterhouse). Our approach opens up new avenues for the development of a fast, affordable and effective computational solutions that provide novel insights into the broader epidemiology of antimicrobial resistance in livestock farming.
Collapse
Affiliation(s)
- Zixin Peng
- NHC Key Laboratory of Food Safety Risk Assessment, Chinese Academy of Medical Science Research Unit (2019RU014), China National Center for Food Safety Risk Assessment, Beijing, People’s Republic of China
| | - Alexandre Maciel-Guerra
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington, United Kingdom
| | - Michelle Baker
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington, United Kingdom
| | - Xibin Zhang
- Qingdao Tian run Food Co., Ltd, New Hope, Beijing, People’s Republic of China
| | - Yue Hu
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington, United Kingdom
| | - Wei Wang
- NHC Key Laboratory of Food Safety Risk Assessment, Chinese Academy of Medical Science Research Unit (2019RU014), China National Center for Food Safety Risk Assessment, Beijing, People’s Republic of China
| | - Jia Rong
- Qingdao Tian run Food Co., Ltd, New Hope, Beijing, People’s Republic of China
| | - Jing Zhang
- NHC Key Laboratory of Food Safety Risk Assessment, Chinese Academy of Medical Science Research Unit (2019RU014), China National Center for Food Safety Risk Assessment, Beijing, People’s Republic of China
| | - Ning Xue
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington, United Kingdom
| | - Paul Barrow
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington, United Kingdom
- School of Veterinary Medicine, University of Surrey, Guildford, Surrey, United Kingdom
| | - David Renney
- Nimrod Veterinary Products Limited, Moreton-in-Marsh, United Kingdom
| | - Dov Stekel
- School of Biosciences, University of Nottingham, Sutton Bonington, United Kingdom
| | - Paul Williams
- Biodiscovery Institute and School of Life Sciences, University of Nottingham, Nottingham, United Kingdom
| | - Longhai Liu
- Qingdao Tian run Food Co., Ltd, New Hope, Beijing, People’s Republic of China
| | - Junshi Chen
- NHC Key Laboratory of Food Safety Risk Assessment, Chinese Academy of Medical Science Research Unit (2019RU014), China National Center for Food Safety Risk Assessment, Beijing, People’s Republic of China
| | - Fengqin Li
- NHC Key Laboratory of Food Safety Risk Assessment, Chinese Academy of Medical Science Research Unit (2019RU014), China National Center for Food Safety Risk Assessment, Beijing, People’s Republic of China
- * E-mail: (FL); (TD)
| | - Tania Dottorini
- School of Veterinary Medicine and Science, University of Nottingham, Sutton Bonington, United Kingdom
- * E-mail: (FL); (TD)
| |
Collapse
|
9
|
Tarasova O, Poroikov V. Machine Learning in Discovery of New Antivirals and Optimization of Viral Infections Therapy. Curr Med Chem 2021; 28:7840-7861. [PMID: 33949929 DOI: 10.2174/0929867328666210504114351] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Revised: 02/13/2021] [Accepted: 02/24/2021] [Indexed: 11/22/2022]
Abstract
Nowadays, computational approaches play an important role in the design of new drug-like compounds and optimization of pharmacotherapeutic treatment of diseases. The emerging growth of viral infections, including those caused by the Human Immunodeficiency Virus (HIV), Ebola virus, recently detected coronavirus, and some others, leads to many newly infected people with a high risk of death or severe complications. A huge amount of chemical, biological, clinical data is at the disposal of the researchers. Therefore, there are many opportunities to find the relationships between the particular features of chemical data and the antiviral activity of biologically active compounds based on machine learning approaches. Biological and clinical data can also be used for building models to predict relationships between viral genotype and drug resistance, which might help determine the clinical outcome of treatment. In the current study, we consider machine-learning approaches in the antiviral research carried out during the past decade. We overview in detail the application of machine-learning methods for the design of new potential antiviral agents and vaccines, drug resistance prediction, and analysis of virus-host interactions. Our review also covers the perspectives of using the machine-learning approaches for antiviral research, including Dengue, Ebola viruses, Influenza A, Human Immunodeficiency Virus, coronaviruses, and some others.
Collapse
Affiliation(s)
- Olga Tarasova
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow. Russian Federation
| | - Vladimir Poroikov
- Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow. Russian Federation
| |
Collapse
|
10
|
Fitzpatrick AH, Rupnik A, O'Shea H, Crispie F, Keaveney S, Cotter P. High Throughput Sequencing for the Detection and Characterization of RNA Viruses. Front Microbiol 2021; 12:621719. [PMID: 33692767 PMCID: PMC7938315 DOI: 10.3389/fmicb.2021.621719] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 01/20/2021] [Indexed: 12/12/2022] Open
Abstract
This review aims to assess and recommend approaches for targeted and agnostic High Throughput Sequencing of RNA viruses in a variety of sample matrices. HTS also referred to as deep sequencing, next generation sequencing and third generation sequencing; has much to offer to the field of environmental virology as its increased sequencing depth circumvents issues with cloning environmental isolates for Sanger sequencing. That said however, it is important to consider the challenges and biases that method choice can impart to sequencing results. Here, methodology choices from RNA extraction, reverse transcription to library preparation are compared based on their impact on the detection or characterization of RNA viruses.
Collapse
Affiliation(s)
- Amy H. Fitzpatrick
- Food Biosciences, Teagasc Food Research Centre, Fermoy, Ireland
- Shellfish Microbiology, Marine Institute, Oranmore, Ireland
- Biological Sciences, Munster Technological University, Cork, Ireland
| | | | - Helen O'Shea
- Biological Sciences, Munster Technological University, Cork, Ireland
| | - Fiona Crispie
- Food Biosciences, Teagasc Food Research Centre, Fermoy, Ireland
| | | | - Paul Cotter
- Food Biosciences, Teagasc Food Research Centre, Fermoy, Ireland
| |
Collapse
|