1
|
Khrennikov A, Iryama S, Basieva I, Sato K. Quantum-like environment adaptive model for creation of phenotype. Biosystems 2024; 242:105261. [PMID: 38964651 DOI: 10.1016/j.biosystems.2024.105261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Revised: 06/26/2024] [Accepted: 06/26/2024] [Indexed: 07/06/2024]
Abstract
The textbook conceptualization of phenotype creation, "genotype (G) + environment (E) + genotype & environment interactions (GE) ↦ phenotype (Ph)", is modeled with open quantum systems theory (OQST) or more generally with adaptive dynamics theory (ADT). The model is quantum-like, i.e., it is not about quantum physical processes in biosystems. Generally such modeling is about applications of the quantum formalism and methodology outside of physics. Macroscopic biosystems, in our case genotypes and phenotypes, are treated as information processors which functioning matches the laws of quantum information theory. Phenotypes are the outputs of the E-adaptation processes described by the quantum master equation, Gorini-Kossakowski-Sudarshan-Lindblad equation (GKSL). Its stationary states correspond to phenotypes. We highlight the class of GKSL dynamics characterized by the camel-like graphs of (von Neumann) entropy: in the process of E-adaptation phenotype's state entropy (disorder) first increases and then falls down - a stable and well-ordered phenotype is created. Traits, an organism's phenotypic characteristics, are modeled within the quantum measurement theory, as generally unsharp observables given by positive operator valued measures (POVMs. This paper is also a review on the methods and mathematical apparatus of quantum information biology.
Collapse
Affiliation(s)
- Andrei Khrennikov
- Linnaeus University, International Center for Mathematical Modeling in Physics and Cognitive Sciences Växjö, SE-351 95, Sweden.
| | - Satoshi Iryama
- Tokyo University of Science, Faculty of Science and Technology, Department of Information Sciences, Noda City, Chiba 278-8510, Japan
| | - Irina Basieva
- Linnaeus University, International Center for Mathematical Modeling in Physics and Cognitive Sciences Växjö, SE-351 95, Sweden
| | - Keiko Sato
- Tokyo University of Science, Faculty of Science and Technology, Department of Information Sciences, Noda City, Chiba 278-8510, Japan
| |
Collapse
|
2
|
Ataş PK. A novel hybrid model to predict concomitant diseases for Hashimoto's thyroiditis. BMC Bioinformatics 2023; 24:319. [PMID: 37620755 PMCID: PMC10464155 DOI: 10.1186/s12859-023-05443-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023] Open
Abstract
Hashimoto's thyroiditis is an autoimmune disorder characterized by the destruction of thyroid cells through immune-mediated mechanisms involving cells and antibodies. The condition can trigger disturbances in metabolism, leading to the development of other autoimmune diseases, known as concomitant diseases. Multiple concomitant diseases may coexist in a single individual, making it challenging to diagnose and manage them effectively. This study aims to propose a novel hybrid algorithm that classifies concomitant diseases associated with Hashimoto's thyroiditis based on sequences. The approach involves building distinct prediction models for each class and using the output of one model as input for the subsequent one, resulting in a dynamic decision-making process. Genes associated with concomitant diseases were collected alongside those related to Hashimoto's thyroiditis, and their sequences were obtained from the NCBI site in fasta format. The hybrid algorithm was evaluated against common machine learning algorithms and their various combinations. The experimental results demonstrate that the proposed hybrid model outperforms existing classification methods in terms of performance metrics. The significance of this study lies in its two distinctive aspects. Firstly, it presents a new benchmarking dataset that has not been previously developed in this field, using diverse methods. Secondly, it proposes a more effective and efficient solution that accounts for the dynamic nature of the dataset. The hybrid approach holds promise in investigating the genetic heterogeneity of complex diseases such as Hashimoto's thyroiditis and identifying new autoimmune disease genes. Additionally, the results of this study may aid in the development of genetic screening tools and laboratory experiments targeting Hashimoto's thyroiditis genetic risk factors. New software, models, and techniques for computing, including systems biology, machine learning, and artificial intelligence, are used in our study.
Collapse
Affiliation(s)
- Pınar Karadayı Ataş
- Department of Computer Engineering, Istanbul Arel University, 34537, Buyukcekmece, Istanbul, Turkey.
| |
Collapse
|
3
|
Mathur G, Pandey A, Goyal S. A review on blockchain for DNA sequence: security issues, application in DNA classification, challenges and future trends. MULTIMEDIA TOOLS AND APPLICATIONS 2023:1-23. [PMID: 37362738 PMCID: PMC10209554 DOI: 10.1007/s11042-023-15857-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 03/09/2023] [Accepted: 05/15/2023] [Indexed: 06/28/2023]
Abstract
In biological science, the study of DNA sequences is considered an important factor because it carries the genomic details that can be used by researchers and doctors for the early prediction of disease using DNA classification. The NCBI has the world's largest database of genetic sequences, but the security of this massive amount of data is currently the greatest issue. One of the options is to encrypt these genetic sequences using blockchain technology. As a result, this paper presents a survey on healthcare data breaches, the necessity for blockchain in healthcare, and the number of research studies done in this area. In addition, the report suggests DNA sequence classification for earlier disease identification and evaluates previous work in the field.
Collapse
Affiliation(s)
- Garima Mathur
- Department of Computer Science and Engineering, UIT, RGPV, Bhopal, India
| | - Anjana Pandey
- Department of Information Technology, UIT, RGPV, Bhopal, India
| | - Sachin Goyal
- Department of Information Technology, UIT, RGPV, Bhopal, India
| |
Collapse
|
4
|
Mathur G, Pandey A, Goyal S. A comprehensive tool for rapid and accurate prediction of disease using DNA sequence classifier. JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING 2022; 14:1-17. [PMID: 35789598 PMCID: PMC9243743 DOI: 10.1007/s12652-022-04099-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 06/06/2022] [Indexed: 06/15/2023]
Abstract
In the current pandemic situation where the coronavirus is spreading very fast that can jump from one human to another. Along with this, there are millions of viruses for example Ebola, SARS, etc. that can spread as fast as the coronavirus due to the mobilization and globalization of the population and are equally deadly. Earlier identification of these viruses can prevent the outbreaks that we are facing currently as well as can help in the earlier designing of drugs. Identification of disease at a prior stage can be achieved through DNA sequence classification as DNA carries most of the genetic information about organisms. This is the reason why the classification of DNA sequences plays an important role in computational biology. This paper has presented a solution in which samples collected from NCBI are used for the classification of DNA sequences. DNA sequence classification will in turn gives the pattern of various diseases; these patterns are then compared with the samples of a newly infected person and can help in the earlier identification of disease. However, feature extraction always remains a big issue. In this paper, a machine learning-based classifier and a new technique for extracting features from DNA sequences based on a hot vector matrix have been proposed. In the hot vector representation of the DNA sequence, each pair of the word is represented using a binary matrix which represents the position of each nucleotide in the DNA sequence. The resultant matrix is then given as an input to the traditional CNN for feature extraction. The results of the proposed method have been compared with 5 well-known classifiers namely Convolution neural network (CNN), Support Vector Machines (SVM), K-Nearest Neighbor (KNN) algorithm, Decision Trees, Recurrent Neural Networks (RNN) on several parameters including precision rate and accuracy and the result shows that the proposed method gives an accuracy of 93.9%, which is highest compared to other classifiers.
Collapse
Affiliation(s)
- Garima Mathur
- Department of Computer Science and Engineering, UIT, RGPV, Bhopal, India
| | - Anjana Pandey
- Department of Information Technology, UIT, RGPV, Bhopal, India
| | - Sachin Goyal
- Department of Information Technology, UIT, RGPV, Bhopal, India
| |
Collapse
|
5
|
Du Z, Xiao X, Uversky VN. Classification of Chromosomal DNA Sequences Using Hybrid Deep Learning Architectures. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200224095531] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Chromosomal DNA contains most of the genetic information of
eukaryotes and plays an important role in the growth, development and reproduction of living
organisms. Most chromosomal DNA sequences are known to wrap around histones, and
distinguishing these DNA sequences from ordinary DNA sequences is important for understanding
the genetic code of life. The main difficulty behind this problem is the feature selection process.
DNA sequences have no explicit features, and the common representation methods, such as onehot
coding, introduced the major drawback of high dimensionality. Recently, deep learning models
have been proved to be able to automatically extract useful features from input patterns.
Objective:
We aim to investigate which deep learning networks could achieve notable
improvements in the field of DNA sequence classification using only sequence information.
Methods: In this paper, we present four different deep learning architectures using convolutional
neural networks and long short-term memory networks for the purpose of chromosomal DNA
sequence classification. Natural language model Word2vec was used to generate word embedding
of sequence and learn features from it by deep learning.
Results:
The comparison of these four architectures is carried out on 10 chromosomal DNA
datasets. The results show that the architecture of convolutional neural networks combined with
long short-term memory networks is superior to other methods with regards to the accuracy of
chromosomal DNA prediction.
Conclusion:
In this study, four deep learning models were compared for an automatic classification
of chromosomal DNA sequences with no steps of sequence preprocessing. In particular, we have
regarded DNA sequences as natural language and extracted word embedding with Word2Vec to
represent DNA sequences. Results show a superiority of the CNN+LSTM model in the ten
classification tasks. The reason for this success is that the CNN module captures the regulatory
motifs, while the following LSTM layer captures the long-term dependencies between them.
Collapse
Affiliation(s)
- Zhihua Du
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, China
| | - Xiangdong Xiao
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, China
| | - Vladimir N. Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, Florida, (V.N.U.), United States
| |
Collapse
|
6
|
Power spectrum and dynamic time warping for DNA sequences classification. EVOLVING SYSTEMS 2020. [DOI: 10.1007/s12530-019-09306-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
7
|
Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L. Review on the Application of Machine Learning Algorithms in the Sequence Data Mining of DNA. Front Bioeng Biotechnol 2020; 8:1032. [PMID: 33015010 PMCID: PMC7498545 DOI: 10.3389/fbioe.2020.01032] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 08/10/2020] [Indexed: 11/13/2022] Open
Abstract
Deoxyribonucleic acid (DNA) is a biological macromolecule. Its main function is information storage. At present, the advancement of sequencing technology had caused DNA sequence data to grow at an explosive rate, which has also pushed the study of DNA sequences in the wave of big data. Moreover, machine learning is a powerful technique for analyzing largescale data and learns spontaneously to gain knowledge. It has been widely used in DNA sequence data analysis and obtained a lot of research achievements. Firstly, the review introduces the development process of sequencing technology, expounds on the concept of DNA sequence data structure and sequence similarity. Then we analyze the basic process of data mining, summary several major machine learning algorithms, and put forward the challenges faced by machine learning algorithms in the mining of biological sequence data and possible solutions in the future. Then we review four typical applications of machine learning in DNA sequence data: DNA sequence alignment, DNA sequence classification, DNA sequence clustering, and DNA pattern mining. We analyze their corresponding biological application background and significance, and systematically summarized the development and potential problems in the field of DNA sequence data mining in recent years. Finally, we summarize the content of the review and look into the future of some research directions for the next step.
Collapse
Affiliation(s)
- Aimin Yang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Wei Zhang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Jiahao Wang
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Ke Yang
- College of Yi Sheng, North China University of Science and Technology, Tangshan, China
| | - Yang Han
- College of Science, North China University of Science and Technology, Tangshan, China
| | - Limin Zhang
- Mathmatics and Computer Department, Hengshui University, Hengshui, China
| |
Collapse
|
8
|
Graphical classification of DNA sequences of HLA alleles by deep learning. Hum Cell 2018; 31:102-105. [PMID: 29327117 PMCID: PMC5852191 DOI: 10.1007/s13577-017-0194-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2017] [Accepted: 11/22/2017] [Indexed: 12/03/2022]
Abstract
Alleles of human leukocyte antigen (HLA)-A DNAs are classified and expressed graphically by using artificial intelligence “Deep Learning (Stacked autoencoder)”. Nucleotide sequence data corresponding to the length of 822 bp, collected from the Immuno Polymorphism Database, were compressed to 2-dimensional representation and were plotted. Profiles of the two-dimensional plots indicate that the alleles can be classified as clusters are formed. The two-dimensional plot of HLA-A DNAs gives a clear outlook for characterizing the various alleles.
Collapse
|
9
|
Yaragatti M, Sandler T, Ungar L. A predictive model for identifying mini-regulatory modules in the mouse genome. Bioinformatics 2008; 25:353-7. [DOI: 10.1093/bioinformatics/btn622] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
|
10
|
Rosen G, Garbarine E, Caseiro D, Polikar R, Sokhansanj B. Metagenome fragment classification using N-mer frequency profiles. Adv Bioinformatics 2008; 2008:205969. [PMID: 19956701 PMCID: PMC2777009 DOI: 10.1155/2008/205969] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2008] [Revised: 09/19/2008] [Accepted: 09/30/2008] [Indexed: 11/17/2022] Open
Abstract
A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.
Collapse
Affiliation(s)
- Gail Rosen
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA
| | - Elaine Garbarine
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA
| | | | - Robi Polikar
- Department of Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028, USA
| | - Bahrad Sokhansanj
- School of Biomedical Engineering, Science & Health Systems, Drexel University, Philadelphia, PA 19130, USA
| |
Collapse
|
11
|
Liang G, Li Z. Scores of generalized base properties for quantitative sequence-activity modelings for E. coli promoters based on support vector machine. J Mol Graph Model 2007; 26:269-81. [PMID: 17291800 DOI: 10.1016/j.jmgm.2006.12.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2006] [Revised: 11/18/2006] [Accepted: 12/10/2006] [Indexed: 10/23/2022]
Abstract
A novel base sequence representation technique, namely SGBP (scores of generalized base properties), was derived from principal component analysis of a matrix of 1209 property parameters including 0D, 1D, 2D and 3D information for five bases such as A, C, G, T and U. It was then employed to represent sequence structures of E. coli promoters. Variables which were used as inputs of partial least square (PLS) and support vector machine (SVM) were selected by genetic arithmetic-partial least square. All samples were divided into train set which was applied to develop quantitative sequence-activity modelings (QSAMs) and test set which was used to validate the predictive power of the resulting models according to D-optimal design. Investigation on QSAM by PLS showed properties of base of position -42, -34, -31, -33, -41, -46 and -29 may yield more influence on strengths, which has thus pointed us further into the direction of strong promoters. Parameters of SVM were determined by response surface methodology. Satisfactory results indicated that the simulative and the predictive abilities for the internal and external samples of QSAM by SVM were better than those of PLS. Those results showed that SGBP is a useful structural representation methodology in QSAMs due to its many advantages including plentiful structural information, easy manipulation, and high characterization competence. Moreover, SGBP-GA-SVM route for sequences design and activities prediction of DNA or RNA can further be applied.
Collapse
Affiliation(s)
- Guizhao Liang
- College of Bioengineering, Chongqing University, Chongqing 400030, PR China
| | | |
Collapse
|
12
|
Radomski JP, Slonimski PP. Primary sequences of proteins from complete genomes display a singular periodicity: Alignment-free N-gram analysis. C R Biol 2007; 330:33-48. [PMID: 17241946 DOI: 10.1016/j.crvi.2006.11.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2006] [Accepted: 11/07/2006] [Indexed: 11/25/2022]
Abstract
A method is proposed to represent and to analyze complete genome sequences (52 species from procaryotes and eukaryotes), based upon n-gram sequence's frequencies of amino acid pairs (bigrams), separated by a given number of other residues. For each of the species analyzed, it allows us to construct over-abundant and over-deficient occurrence profiles, summarizing amino acid bigram frequencies over the entire genome. The method deals efficiently with a sparseness of statistical representations of individual sequences, and describes every gene sequence in the same way, independently of its length and of the genome sizes. The frequency of over-abundant and over-deficient occurrences of bigrams presents a singular periodicity around 3.5 peptide bonds, suggesting a relation with the alpha helical secondary structure.
Collapse
Affiliation(s)
- Jan P Radomski
- Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University, Pawińskiego 5A, Bldg. D, 02106 Warsaw, Poland.
| | | |
Collapse
|
13
|
Cannon CH, Kua CS, Lobenhofer EK, Hurban P. Capturing genomic signatures of DNA sequence variation using a standard anonymous microarray platform. Nucleic Acids Res 2006; 34:e121. [PMID: 17000641 PMCID: PMC1636412 DOI: 10.1093/nar/gkl478] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Comparative genomics, using the model organism approach, has provided powerful insights into the structure and evolution of whole genomes. Unfortunately, only a small fraction of Earth's biodiversity will have its genome sequenced in the foreseeable future. Most wild organisms have radically different life histories and evolutionary genomics than current model systems. A novel technique is needed to expand comparative genomics to a wider range of organisms. Here, we describe a novel approach using an anonymous DNA microarray platform that gathers genomic samples of sequence variation from any organism. Oligonucleotide probe sequences placed on a custom 44 K array were 25 bp long and designed using a simple set of criteria to maximize their complexity and dispersion in sequence probability space. Using whole genomic samples from three known genomes (mouse, rat and human) and one unknown (Gonystylus bancanus), we demonstrate and validate its power, reliability, transitivity and sensitivity. Using two separate statistical analyses, a large numbers of genomic ‘indicator’ probes were discovered. The construction of a genomic signature database based upon this technique would allow virtual comparisons and simple queries could generate optimal subsets of markers to be used in large-scale assays, using simple downstream techniques. Biologists from a wide range of fields, studying almost any organism, could efficiently perform genomic comparisons, at potentially any phylogenetic level after performing a small number of standardized DNA microarray hybridizations. Possibilities for refining and expanding the approach are discussed.
Collapse
Affiliation(s)
- C. H. Cannon
- To whom correspondence should be addressed. Tel: +1 806 742 3993; Fax: +1 806 742 2963;
| | - C. S. Kua
- 27 Jln. Dato Haji Harun, Taman Tayton ViewKuala Lumpur, Malaysia
| | - E. K. Lobenhofer
- Paradigm Array Labs, a Service Unit of Icoria Inc.Research Triangle Park, NC 27709, USA
| | - P. Hurban
- Paradigm Array Labs, a Service Unit of Icoria Inc.Research Triangle Park, NC 27709, USA
| |
Collapse
|