1
|
Magateshvaren Saras MA, Mitra MK, Tyagi S. Navigating the Multiverse: a Hitchhiker's guide to selecting harmonization methods for multimodal biomedical data. Biol Methods Protoc 2025; 10:bpaf028. [PMID: 40308831 PMCID: PMC12043205 DOI: 10.1093/biomethods/bpaf028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Revised: 03/20/2025] [Accepted: 04/15/2025] [Indexed: 05/02/2025] Open
Abstract
The application of machine learning (ML) techniques in predictive modelling has greatly advanced our comprehension of biological systems. There is a notable shift in the trend towards integration methods that specifically target the simultaneous analysis of multiple modes or types of data, showcasing superior results compared to individual analyses. Despite the availability of diverse ML architectures for researchers interested in embracing a multimodal approach, the current literature lacks a comprehensive taxonomy that includes the pros and cons of these methods to guide the entire process. Closing this gap is imperative, necessitating the creation of a robust framework. This framework should not only categorize the diverse ML architectures suitable for multimodal analysis but also offer insights into their respective advantages and limitations. Additionally, such a framework can serve as a valuable guide for selecting an appropriate workflow for multimodal analysis. This comprehensive taxonomy would provide a clear guidance and support informed decision-making within the progressively intricate landscape of biomedical and clinical data analysis. This is an essential step towards advancing personalized medicine. The aims of the work are to comprehensively study and describe the harmonization processes that are performed and reported in the literature and present a working guide that would enable planning and selecting an appropriate integrative model. We present harmonization as a dual process of representation and integration, each with multiple methods and categories. The taxonomy of the various representation and integration methods are classified into six broad categories and detailed with the advantages, disadvantages and examples. A guide flowchart describing the step-by-step processes that are needed to adopt a multimodal approach is also presented along with examples and references. This review provides a thorough taxonomy of methods for harmonizing multimodal data and introduces a foundational 10-step guide for newcomers to implement a multimodal workflow.
Collapse
Affiliation(s)
- Murali Aadhitya Magateshvaren Saras
- IITB-Monash Research Academy, Mumbai, Maharashtra 400076, India
- Department of Physics, Indian Institute of Technology Bombay, Mumbai, Maharashtra 400076, India
- School of Translational Medicine, Monash University, Melbourne, Victoria 3181, Australia
| | - Mithun K Mitra
- Department of Physics, Indian Institute of Technology Bombay, Mumbai, Maharashtra 400076, India
| | - Sonika Tyagi
- School of Translational Medicine, Monash University, Melbourne, Victoria 3181, Australia
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3001, Australia
| |
Collapse
|
2
|
Tyagi N, Vahab N, Tyagi S. Genome language modeling (GLM): a beginner's cheat sheet. Biol Methods Protoc 2025; 10:bpaf022. [PMID: 40370585 PMCID: PMC12077296 DOI: 10.1093/biomethods/bpaf022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/17/2025] [Accepted: 03/23/2025] [Indexed: 05/16/2025] Open
Abstract
Integrating genomics with diverse data modalities has the potential to revolutionize personalized medicine. However, this integration poses significant challenges due to the fundamental differences in data types and structures. The vast size of the genome necessitates transformation into a condensed representation containing key biomarkers and relevant features to ensure interoperability with other modalities. This commentary explores both conventional and state-of-the-art approaches to genome language modeling (GLM), with a focus on representing and extracting meaningful features from genomic sequences. We focus on the latest trends of applying language modeling techniques on genomics sequence data, treating it as a text modality. Effective feature extraction is essential in enabling machine learning models to effectively analyze large genomic datasets, particularly within multimodal frameworks. We first provide a step-by-step guide to various genomic sequence preprocessing and tokenization techniques. Then we explore feature extraction methods for the transformation of tokens using frequency, embedding, and neural network-based approaches. In the end, we discuss machine learning (ML) applications in genomics, focusing on classification, regression, language processing algorithms, and multimodal integration. Additionally, we explore the role of GLM in functional annotation, emphasizing how advanced ML models, such as Bidirectional encoder representations from transformers, enhance the interpretation of genomic data. To the best of our knowledge, we compile the first end-to-end analytic guide to convert complex genomic data into biologically interpretable information using GLM, thereby facilitating the development of novel data-driven hypotheses.
Collapse
Affiliation(s)
- Navya Tyagi
- AI and Data Science, Indian Institute of Technology, Madras, Chennai 600036, Tamil Nadu, India
- Amity Institute of Integrative Health Sciences, Amity University, Gurugram 122412, Haryana, India
| | - Naima Vahab
- School of Computing Technologies, Royal Melbourne Institute of Technology (RMIT) University, 3001 Melbourne, Australia
| | - Sonika Tyagi
- School of Computing Technologies, Royal Melbourne Institute of Technology (RMIT) University, 3001 Melbourne, Australia
| |
Collapse
|
3
|
Pike AMC, Amal S, Maginnis MS, Wilczek MP. Evaluating Neural Network Performance in Predicting Disease Status and Tissue Source of JC Polyomavirus from Patient Isolates Based on the Hypervariable Region of the Viral Genome. Viruses 2024; 17:12. [PMID: 39861801 PMCID: PMC11769028 DOI: 10.3390/v17010012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 12/23/2024] [Accepted: 12/24/2024] [Indexed: 01/27/2025] Open
Abstract
JC polyomavirus (JCPyV) establishes a persistent, asymptomatic kidney infection in most of the population. However, JCPyV can reactivate in immunocompromised individuals and cause progressive multifocal leukoencephalopathy (PML), a fatal demyelinating disease with no approved treatment. Mutations in the hypervariable non-coding control region (NCCR) of the JCPyV genome have been linked to disease outcomes and neuropathogenesis, yet few metanalyses document these associations. Many online sequence entries, including those on NCBI databases, lack sufficient sample information, limiting large-scale analyses of NCCR sequences. Machine learning techniques, however, can augment available data for analysis. This study employs a previously compiled dataset of 989 JCPyV NCCR sequences from GenBank with associated patient PML status and viral tissue source to train multilayer perceptrons for predicting missing information within the dataset. The PML status and tissue source models were 100% and 87.8% accurate, respectively. Within the dataset, 348 samples had an unconfirmed PML status, where 259 were predicted as No PML and 89 as PML sequences. Of the 63 sequences with unconfirmed tissue sources, eight samples were predicted as urine, 13 as blood, and 42 as cerebrospinal fluid. These models can improve viral sequence identification and provide insights into viral mutations and pathogenesis.
Collapse
Affiliation(s)
- Aiden M. C. Pike
- Maine Space Grant Consortium, Augusta, ME 04330, USA;
- Life Sciences, Health, and Engineering Department, The Roux Institute, Northeastern University, Portland, ME 04101, USA
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME 04469, USA;
| | - Saeed Amal
- The Roux Institute, Northeastern University, Portland, ME 04101, USA;
- Department of Bioengineering, College of Engineering, Northeastern University, Boston, MA 02115, USA
| | - Melissa S. Maginnis
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME 04469, USA;
- Graduate School in Biomedical Science and Engineering, University of Maine, Orono, ME 04469, USA
| | - Michael P. Wilczek
- Life Sciences, Health, and Engineering Department, The Roux Institute, Northeastern University, Portland, ME 04101, USA
- Observational Health Data Sciences and Informatics Center, The Roux Institute, Northeastern University, Portland, ME 04101, USA
- Department of Chemistry and Chemical Biology, College of Science, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
4
|
Fernandez‐Moreno J, Yaschenko AE, Neubauer M, Marchi AJ, Zhao C, Ascencio‐Ibanez JT, Alonso JM, Stepanova AN. A rapid and scalable approach to build synthetic repetitive hormone-responsive promoters. PLANT BIOTECHNOLOGY JOURNAL 2024; 22:1942-1956. [PMID: 38379432 PMCID: PMC11182585 DOI: 10.1111/pbi.14313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 02/04/2024] [Accepted: 02/06/2024] [Indexed: 02/22/2024]
Abstract
Advancement of DNA-synthesis technologies has greatly facilitated the development of synthetic biology tools. However, high-complexity DNA sequences containing tandems of short repeats are still notoriously difficult to produce synthetically, with commercial DNA synthesis companies usually rejecting orders that exceed specific sequence complexity thresholds. To overcome this limitation, we developed a simple, single-tube reaction method that enables the generation of DNA sequences containing multiple repetitive elements. Our strategy involves commercial synthesis and PCR amplification of padded sequences that contain the repeats of interest, along with random intervening sequence stuffers that include type IIS restriction enzyme sites. GoldenBraid molecular cloning technology is then employed to remove the stuffers, rejoin the repeats together in a predefined order, and subclone the tandem(s) in a vector using a single-tube digestion-ligation reaction. In our hands, this new approach is much simpler, more versatile and efficient than previously developed solutions to this problem. As a proof of concept, two different phytohormone-responsive, synthetic, repetitive proximal promoters were generated and tested in planta in the context of transcriptional reporters. Analysis of transgenic lines carrying the synthetic ethylene-responsive promoter 10x2EBS-S10 fused to the GUS reporter gene uncovered several developmentally regulated ethylene response maxima, indicating the utility of this reporter for monitoring the involvement of ethylene in a variety of physiologically relevant processes. These encouraging results suggest that this reporter system can be leveraged to investigate the ethylene response to biotic and abiotic factors with high spatial and temporal resolution.
Collapse
Affiliation(s)
| | - Anna E. Yaschenko
- Department of Plant and Microbial BiologyNorth Carolina State UniversityRaleighNCUSA
| | - Matthew Neubauer
- Department of Plant and Microbial BiologyNorth Carolina State UniversityRaleighNCUSA
| | - Alex J. Marchi
- Department of Plant and Microbial BiologyNorth Carolina State UniversityRaleighNCUSA
| | - Chengsong Zhao
- Department of Plant and Microbial BiologyNorth Carolina State UniversityRaleighNCUSA
| | - José T. Ascencio‐Ibanez
- Department of Molecular and Structural BiochemistryNorth Carolina State UniversityRaleighNCUSA
| | - Jose M. Alonso
- Department of Plant and Microbial BiologyNorth Carolina State UniversityRaleighNCUSA
| | - Anna N. Stepanova
- Department of Plant and Microbial BiologyNorth Carolina State UniversityRaleighNCUSA
| |
Collapse
|
5
|
Motoche-Monar C, Ordoñez JE, Chang O, Gonzales-Zubiate FA. gRNA Design: How Its Evolution Impacted on CRISPR/Cas9 Systems Refinement. Biomolecules 2023; 13:1698. [PMID: 38136570 PMCID: PMC10741458 DOI: 10.3390/biom13121698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 06/05/2023] [Accepted: 06/12/2023] [Indexed: 12/24/2023] Open
Abstract
Over the past decade, genetic engineering has witnessed a revolution with the emergence of a relatively new genetic editing tool based on RNA-guided nucleases: the CRISPR/Cas9 system. Since the first report in 1987 and characterization in 2007 as a bacterial defense mechanism, this system has garnered immense interest and research attention. CRISPR systems provide immunity to bacteria against invading genetic material; however, with specific modifications in sequence and structure, it becomes a precise editing system capable of modifying the genomes of a wide range of organisms. The refinement of these modifications encompasses diverse approaches, including the development of more accurate nucleases, understanding of the cellular context and epigenetic conditions, and the re-designing guide RNAs (gRNAs). Considering the critical importance of the correct performance of CRISPR/Cas9 systems, our scope will emphasize the latter approach. Hence, we present an overview of the past and the most recent guide RNA web-based design tools, highlighting the evolution of their computational architecture and gRNA characteristics over the years. Our study explains computational approaches that use machine learning techniques, neural networks, and gRNA/target interactions data to enable predictions and classifications. This review could open the door to a dynamic community that uses up-to-date algorithms to optimize and create promising gRNAs, suitable for modern CRISPR/Cas9 engineering.
Collapse
Affiliation(s)
- Cristofer Motoche-Monar
- School of Biological Sciences and Engineering, Yachay Tech University, Urcuquí 100119, Ecuador
| | - Julián E. Ordoñez
- School of Biological Sciences and Engineering, Yachay Tech University, Urcuquí 100119, Ecuador
| | - Oscar Chang
- Departamento de Electrónica, Universidad Simon Bolivar, Caracas 1080, Venezuela
- MIND Research Group, Model Intelligent Networks Development, Urcuquí 100119, Ecuador
| | - Fernando A. Gonzales-Zubiate
- School of Biological Sciences and Engineering, Yachay Tech University, Urcuquí 100119, Ecuador
- MIND Research Group, Model Intelligent Networks Development, Urcuquí 100119, Ecuador
| |
Collapse
|
6
|
Liu X, Teng L, Luo Y, Xu Y. Prediction of prokaryotic and eukaryotic promoters based on information-theoretic features. Biosystems 2023; 231:104979. [PMID: 37423595 DOI: 10.1016/j.biosystems.2023.104979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 07/06/2023] [Accepted: 07/07/2023] [Indexed: 07/11/2023]
Abstract
Promoters are DNA regulatory elements located near the transcription start site and are responsible for regulating the transcription of genes. DNA fragments arranged in a certain order form specific functional regions with different information contents. Information theory is the science that studies the extraction, measurement and transmission of information. The genetic information contained in DNA follows the general laws of information storage. Therefore, method in information theory can be used for the analysis of promoters carrying genetic information. In this study, we introduced the concept of information theory to the study of promoter prediction. We used 107 features extracted based on information theory methods and a backpropagation neural network to build a classifier. Then, the trained classifier was applied to predict the promoters of 6 organisms. The average AUCs of the 6 organisms obtained by using hold-out validation and ten-fold cross-validation were 0.885 and 0.886, respectively. The results verified the effectiveness of information-theoretic features in promoter prediction. Considering the possible redundancy in the feature set, we performed feature selection and obtained key feature subsets related to promoter characteristics. The results indicate the potential utility of information-theoretic features in promoter prediction.
Collapse
Affiliation(s)
- Xiao Liu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China.
| | - Li Teng
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Yachuan Luo
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| | - Yuqiao Xu
- School of Microelectronics and Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China
| |
Collapse
|
7
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
8
|
Yasmeen E, Wang J, Riaz M, Zhang L, Zuo K. Designing artificial synthetic promoters for accurate, smart, and versatile gene expression in plants. PLANT COMMUNICATIONS 2023:100558. [PMID: 36760129 PMCID: PMC10363483 DOI: 10.1016/j.xplc.2023.100558] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 01/30/2023] [Accepted: 02/06/2023] [Indexed: 06/18/2023]
Abstract
With the development of high-throughput biology techniques and artificial intelligence, it has become increasingly feasible to design and construct artificial biological parts, modules, circuits, and even whole systems. To overcome the limitations of native promoters in controlling gene expression, artificial promoter design aims to synthesize short, inducible, and conditionally controlled promoters to coordinate the expression of multiple genes in diverse plant metabolic and signaling pathways. Synthetic promoters are versatile and can drive gene expression accurately with smart responses; they show potential for enhancing desirable traits in crops, thereby improving crop yield, nutritional quality, and food security. This review first illustrates the importance of synthetic promoters, then introduces promoter architecture and thoroughly summarizes advances in synthetic promoter construction. Restrictions to the development of synthetic promoters and future applications of such promoters in synthetic plant biology and crop improvement are also discussed.
Collapse
Affiliation(s)
- Erum Yasmeen
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jin Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Muhammad Riaz
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Lida Zhang
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Kaijing Zuo
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China.
| |
Collapse
|
9
|
Barbero-Aparicio JA, Cuesta-Lopez S, García-Osorio CI, Pérez-Rodríguez J, García-Pedrajas N. Nonlinear physics opens a new paradigm for accurate transcription start site prediction. BMC Bioinformatics 2022; 23:565. [PMID: 36585618 PMCID: PMC9801560 DOI: 10.1186/s12859-022-05129-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 12/27/2022] [Indexed: 12/31/2022] Open
Abstract
There is evidence that DNA breathing (spontaneous opening of the DNA strands) plays a relevant role in the interactions of DNA with other molecules, and in particular in the transcription process. Therefore, having physical models that can predict these openings is of interest. However, this source of information has not been used before either in transcription start sites (TSSs) or promoter prediction. In this article, one such model is used as an additional information source that, when used by a machine learning (ML) model, improves the results of current methods for the prediction of TSSs. In addition, we provide evidence on the validity of the physical model, as it is able by itself to predict TSSs with high accuracy. This opens an exciting avenue of research at the intersection of statistical mechanics and ML, where ML models in bioinformatics can be improved using physical models of DNA as feature extractors.
Collapse
Affiliation(s)
- José Antonio Barbero-Aparicio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Santiago Cuesta-Lopez
- grid.23520.360000 0000 8569 1592Universidad de Burgos, Hospital del Rey, s/n, 09001 Burgos, Spain ,ICAMCyL Foundation, Internacional Center for Advanced Materials and Raw Materials of Castilla y León, León Technology Park, main building, first floor, offices 106-108, C/Julia Morros s/n, Armunia, 24009 León, Spain
| | - César Ignacio García-Osorio
- grid.23520.360000 0000 8569 1592Departamento de Informática, Universidad de Burgos, Avda. de Cantabria s/n, 09006 Burgos, Spain
| | - Javier Pérez-Rodríguez
- grid.449008.10000 0004 1795 4150Departamento de Métodos Cuantitativos, Universidad de Loyola Andalucía, Escritor Castilla Aguayo, 4, 14004 Córdoba, Spain
| | - Nicolás García-Pedrajas
- grid.411901.c0000 0001 2183 9102Department of Computing and Numerical Analysis, University of Córdoba, Edificio Albert Einstein, Campus de Rabanales, 14071 Córdoba, Spain
| |
Collapse
|
10
|
Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022; 13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open
Abstract
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Collapse
|
11
|
Bhandari N, Walambe R, Kotecha K, Khare SP. A comprehensive survey on computational learning methods for analysis of gene expression data. Front Mol Biosci 2022; 9:907150. [PMID: 36458095 PMCID: PMC9706412 DOI: 10.3389/fmolb.2022.907150] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 09/28/2022] [Indexed: 09/19/2023] Open
Abstract
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Satyajeet P. Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India
| |
Collapse
|
12
|
CapsProm: a capsule network for promoter prediction. Comput Biol Med 2022; 147:105627. [DOI: 10.1016/j.compbiomed.2022.105627] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 04/05/2022] [Accepted: 04/11/2022] [Indexed: 11/21/2022]
|
13
|
Wei PJ, Pang ZZ, Jiang LJ, Tan D, Su Y, Zheng CH. Promoter Prediction in Nannochloropsis Based on Densely Connected Convolutional Neural Networks. Methods 2022; 204:38-46. [DOI: 10.1016/j.ymeth.2022.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 03/03/2022] [Accepted: 03/28/2022] [Indexed: 10/18/2022] Open
|
14
|
Cazier AP, Blazeck J. Advances in promoter engineering: novel applications and predefined transcriptional control. Biotechnol J 2021; 16:e2100239. [PMID: 34351706 DOI: 10.1002/biot.202100239] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 07/30/2021] [Accepted: 08/03/2021] [Indexed: 11/08/2022]
Abstract
Synthetic biology continues to progress by relying on more robust tools for transcriptional control, of which promoters are the most fundamental component. Numerous studies have sought to characterize promoter function, determine principles to guide their engineering, and create promoters with stronger expression or tailored inducible control. In this review, we will summarize promoter architecture and highlight recent advances in the field, focusing on the novel applications of inducible promoter design and engineering towards metabolic engineering and cellular therapeutic development. Additionally, we will highlight how the expansion of new, machine learning techniques for modeling and engineering promoter sequences are enabling more accurate prediction of promoter characteristics. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Andrew P Cazier
- School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst St. NW, Atlanta, Georgia, 30332, USA
| | - John Blazeck
- School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, 311 Ferst St. NW, Atlanta, Georgia, 30332, USA
| |
Collapse
|