1
|
Ghosh S, Pal J, Cattani C, Maji B, Bhattacharya DK. Protein sequence comparison based on representation on a finite dimensional unit hypercube. J Biomol Struct Dyn 2024; 42:6425-6439. [PMID: 37837426 DOI: 10.1080/07391102.2023.2268719] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 07/01/2023] [Indexed: 10/16/2023]
Abstract
Numerous techniques are used to compare protein sequences based on the values of the physiochemical properties of amino acids. In this work, a single physical/chemical property value based non-binary representation of protein sequences is obtained on a 20 × 20-dimensional unit hypercube. The represented vector expressed in the matrix form is taken as the descriptor. The generalized NTV metric, which is an extension of the NTV metric used for polynucleotide space is taken as a distance measure. Based on this distance measure, a distance matrix is obtained for protein sequence comparison. Using this distance matrix, phylogenetic trees are drawn by using Molecular Evolutionary Genetics Analysis 11 (MEGA11) software applying the neighbor-joining method. Data sets used in this current work are 9-ND4, 9-ND5, 9-ND6, 24 TF-LF proteins, 27 different viruses and 127 proteins from the protein kinase C (PKC) family. Two sets of phylogenetic trees are obtained - one based on property value of polarity and the other based on property value of molecular weight. They are found to be exactly the same. Similar results also hold for other single property value based representation. The present trees are individually tested for efficiency based on the criterion of rationalized perception and computational time. The results of the present method are compared with those obtained earlier by other methods on the same protein sequences using assessment criteria of Symmetric distance (SD), Correlation coefficient, and Rationalized perception. In all the cases, the present results are found to be better than the results of other methods under comparison.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Soumen Ghosh
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
- Information Technology, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Jayanta Pal
- Computer Science & Engineering, Narula Institute of Technology, Kolkata, West Bengal, India
| | - Carlo Cattani
- DEIM, University of Tuscia, Largo dell'Universita, Viterbo, Italy
| | - Bansibadan Maji
- Electronics & Communication Engineering, National Institute of Technology, Durgapur, West Bengal, India
| | | |
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
4
|
Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.
Collapse
|
5
|
Redrado S, Esteban P, Domingo MP, Lopez C, Rezusta A, Ramirez-Labrada A, Arias M, Pardo J, Galvez EM. Integration of In Silico and In Vitro Analysis of Gliotoxin Production Reveals a Narrow Range of Producing Fungal Species. J Fungi (Basel) 2022; 8:jof8040361. [PMID: 35448592 PMCID: PMC9030297 DOI: 10.3390/jof8040361] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 03/28/2022] [Accepted: 03/29/2022] [Indexed: 02/06/2023] Open
Abstract
Gliotoxin is a fungal secondary metabolite with impact on health and agriculture since it might act as virulence factor and contaminate human and animal food. Homologous gliotoxin (GT) gene clusters are spread across a number of fungal species although if they produce GT or other related epipolythiodioxopiperazines (ETPs) remains obscure. Using bioinformatic tools, we have identified homologous gli gene clusters similar to the A. fumigatus GT gene cluster in several fungal species. In silico study led to in vitro confirmation of GT and Bisdethiobis(methylthio)gliotoxin (bmGT) production in fungal strain cultures by HPLC detection. Despite we selected most similar homologous gli gene cluster in 20 different species, GT and bmGT were only detected in section Fumigati species and in a Trichoderma virens Q strain. Our results suggest that in silico gli homology analyses in different fungal strains to predict GT production might be only informative when accompanied by analysis about mycotoxin production in cell cultures.
Collapse
Affiliation(s)
- Sergio Redrado
- Instituto de Carboquımica ICB-CSIC, 50018 Zaragoza, Spain; (S.R.); (M.P.D.)
| | - Patricia Esteban
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | | | - Concepción Lopez
- Department of Microbiology, Hospital Universitario Miguel Servet, IIS Aragón, 50009 Zaragoza, Spain; (C.L.); (A.R.)
| | - Antonio Rezusta
- Department of Microbiology, Hospital Universitario Miguel Servet, IIS Aragón, 50009 Zaragoza, Spain; (C.L.); (A.R.)
| | - Ariel Ramirez-Labrada
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | - Maykel Arias
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
| | - Julián Pardo
- Biomedical Research Centre of Aragon (CIBA), Fundacion Instituto de Investigacion Sanitaria Aragon (IIS Aragon), 50009 Zaragoza, Spain; (P.E.); (A.R.-L.); (M.A.); (J.P.)
- Department of Microbiology, Pediatrics, Radiology and Public Health, University of Zaragoza, 50009 Zaragoza, Spain
- Aragon I+D Foundation (ARAID), 50018 Zaragoza, Spain
| | - Eva M. Galvez
- Instituto de Carboquımica ICB-CSIC, 50018 Zaragoza, Spain; (S.R.); (M.P.D.)
- Correspondence:
| |
Collapse
|
6
|
Identification of HIV Rapid Mutations Using Differences in Nucleotide Distribution over Time. Genes (Basel) 2022; 13:genes13020170. [PMID: 35205215 PMCID: PMC8872422 DOI: 10.3390/genes13020170] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2021] [Revised: 01/08/2022] [Accepted: 01/12/2022] [Indexed: 02/07/2023] Open
Abstract
Mutation is the driving force of species evolution, which may change the genetic information of organisms and obtain selective competitive advantages to adapt to environmental changes. It may change the structure or function of translated proteins, and cause abnormal cell operation, a variety of diseases and even cancer. Therefore, it is particularly important to identify gene regions with high mutations. Mutations will cause changes in nucleotide distribution, which can be characterized by natural vectors globally. Based on natural vectors, we propose a mathematical formula for measuring the difference in nucleotide distribution over time to investigate the mutations of human immunodeficiency virus. The studied dataset is from public databases and includes gene sequences from twenty HIV-infected patients. The results show that the mutation rate of the nine major genes or gene segment regions in the genome exhibits discrepancy during the infected period, and the Env gene has the fastest mutation rate. We deduce that the peak of virus mutation has a close temporal relationship with viral divergence and diversity. The mutation study of HIV is of great significance to clinical diagnosis and drug design.
Collapse
|
7
|
Gorbalenya AE, Lauber C. Bioinformatics of virus taxonomy: foundations and tools for developing sequence-based hierarchical classification. Curr Opin Virol 2021; 52:48-56. [PMID: 34883443 DOI: 10.1016/j.coviro.2021.11.003] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 10/22/2021] [Accepted: 11/04/2021] [Indexed: 11/03/2022]
Abstract
The genome sequence is the only characteristic readily obtainable for all known viruses, underlying the growing role of comparative genomics in organizing knowledge about viruses in a systematic evolution-aware way, known as virus taxonomy. Overseen by the International Committee on Taxonomy of Viruses (ICTV), development of virus taxonomy involves taxa demarcation at 15 ranks of a hierarchical classification, often in host-specific manner. Outside the ICTV remit, researchers assess fitting numerous unclassified viruses into the established taxa. They employ different metrics of virus clustering, basing on conserved domain(s), separation of viruses in rooted phylogenetic trees and pair-wise distance space. Computational approaches differ further in respect to methodology, number of ranks considered, sensitivity to uneven virus sampling, and visualization of results. Advancing and using computational tools will be critical for improving taxa demarcation across the virosphere and resolving rank origins in research that may also inform experimental virology.
Collapse
Affiliation(s)
- Alexander E Gorbalenya
- Department of Medical Microbiology, Leiden University Medical Center, Leiden, The Netherlands; Faculty of Bioengineering and Bioinformatics and Belozersky, Institute of Physico-Chemical Biology, Lomonosov Moscow State University, 119899, Moscow, Russia.
| | - Chris Lauber
- Institute for Experimental Virology, TWINCORE Centre for Experimental and Clinical Infection Research, A Joint Venture between the Hannover Medical School (MHH) and the Helmholtz Centre for Infection Research (HZI), Hannover, Germany
| |
Collapse
|
8
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
9
|
Kaden M, Bohnsack KS, Weber M, Kudła M, Gutowska K, Blazewicz J, Villmann T. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences. Neural Comput Appl 2021; 34:67-78. [PMID: 33935376 PMCID: PMC8076884 DOI: 10.1007/s00521-021-06018-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 04/07/2021] [Indexed: 02/06/2023]
Abstract
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
Collapse
Affiliation(s)
- Marika Kaden
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Katrin Sophie Bohnsack
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mirko Weber
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mateusz Kudła
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
| | - Kaja Gutowska
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Jacek Blazewicz
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Thomas Villmann
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| |
Collapse
|
10
|
Wan X, Tan X. A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200305090055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Protein is a kind of important organics in life. It is varied with its
sequences, structures and functions. Protein evolutionary classification is one of the popular
research topics in computational bioinformatics. Many studies have used protein sequence
information to classify the evolutionary relationships of proteins. As the amount of protein
sequence data increases, efficient computational tools are needed to make efficient protein
evolutionary classifications with high accuracies in the big data paradigm.
Methods:
In this study, we propose a new simple and efficient computational approach based on
the normalized mutual information rates to compute the relationship between protein sequences,
we then use the “distances” defined on the relationships to perform the evolutionary classifications
of proteins. The new method is computational efficient, model-free and unsupervised, which does
not require training data when performing classifications.
Result:
Simulation studies on various examples demonstrate the efficiency of the new method.
We use precision-recall curves to compare the efficiency of our new method with traditional
methods, results show that the new method outperforms the traditional methods in most of the
cases when performing evolutionary classifications.
Conclusion:
The new method is simple and proved to be efficient in protein evolutionary
classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.
Collapse
Affiliation(s)
- Xiaogeng Wan
- Department of Mathematics, College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Xinying Tan
- The Fourth Center of PLA General Hospital, Beijing, 100037, China
| |
Collapse
|
11
|
iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6664362. [PMID: 33505515 PMCID: PMC7808816 DOI: 10.1155/2021/6664362] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Revised: 12/13/2020] [Accepted: 12/28/2020] [Indexed: 02/07/2023]
Abstract
Bioluminescent proteins (BLPs) are a class of proteins that widely distributed in many living organisms with various mechanisms of light emission including bioluminescence and chemiluminescence from luminous organisms. Bioluminescence has been commonly used in various analytical research methods of cellular processes, such as gene expression analysis, drug discovery, cellular imaging, and toxicity determination. However, the identification of bioluminescent proteins is challenging as they share poor sequence similarities among them. In this paper, we briefly reviewed the development of the computational identification of BLPs and subsequently proposed a novel predicting framework for identifying BLPs based on eXtreme gradient boosting algorithm (XGBoost) and using sequence-derived features. To train the models, we collected BLP data from bacteria, eukaryote, and archaea. Then, for getting more effective prediction models, we examined the performances of different feature extraction methods and their combinations as well as classification algorithms. Finally, based on the optimal model, a novel predictor named iBLP was constructed to identify BLPs. The robustness of iBLP has been proved by experiments on training and independent datasets. Comparison with other published method further demonstrated that the proposed method is powerful and could provide good performance for BLP identification. The webserver and software package for BLP identification are freely available at http://lin-group.cn/server/iBLP.
Collapse
|
12
|
Yuan Z, Ye X, Zhu L, Zhang N, An Z, Zheng WJ. Virome assembly and annotation in brain tissue based on next-generation sequencing. Cancer Med 2020; 9:6776-6790. [PMID: 32738030 PMCID: PMC7520322 DOI: 10.1002/cam4.3325] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 06/20/2020] [Accepted: 07/01/2020] [Indexed: 12/15/2022] Open
Abstract
The glioblastoma multiforme (GBM) is one of the deadliest tumors. It has been speculated that virus plays a role in GBM but the evidences are controversy. Published researches are mainly limited to studies on the presence of human cytomegalovirus (HCMV) in GBM. No comprehensive assessment of the brain virome, the collection of viral material in the brain, based on recently sequenced data has been performed. Here, we characterized the virome from 111 GBM samples and 57 normal brain samples from eight projects in the SRA database by a tested and comprehensive assembly approach. The annotation of the assembled contigs showed that most viral sequences in the brain belong to the viral family Retroviridae. In some GBM samples, we also detected full genome sequence of a novel picornavirus recently discovered in invertebrates. Unlike previous reports, our study did not detect herpes virus such as HCMV in GBM from the data we used. However, some contigs that cannot be annotated with any known genes exhibited antibody epitopes in their sequences. These findings provide several avenues for potential cancer therapy: the newly discovered picornavirus could be a starting point to engineer novel oncolytic virus; and the exhibited antibody epitopes could be a source to explore potential drug targets for immune cancer therapy. By characterizing the virosphere in GBM and normal brain at a global level, the results from this study strengthen the link between GBM and viral infection which warrants the further investigation.
Collapse
Affiliation(s)
- Zihao Yuan
- School of Biomedical InformaticsUniversity of Texas Health Science Center at HoustonHoustonTXUSA
- Texas Therapeutics InstituteInstitute of Molecular MedicineMcGovern Medical SchoolUniversity of Texas Health Science Center at HoustonHoustonTXUSA
| | - Xiaohua Ye
- Texas Therapeutics InstituteInstitute of Molecular MedicineMcGovern Medical SchoolUniversity of Texas Health Science Center at HoustonHoustonTXUSA
| | - Lisha Zhu
- School of Biomedical InformaticsUniversity of Texas Health Science Center at HoustonHoustonTXUSA
| | - Ningyan Zhang
- Texas Therapeutics InstituteInstitute of Molecular MedicineMcGovern Medical SchoolUniversity of Texas Health Science Center at HoustonHoustonTXUSA
| | - Zhiqiang An
- Texas Therapeutics InstituteInstitute of Molecular MedicineMcGovern Medical SchoolUniversity of Texas Health Science Center at HoustonHoustonTXUSA
| | - W. Jim Zheng
- School of Biomedical InformaticsUniversity of Texas Health Science Center at HoustonHoustonTXUSA
| |
Collapse
|
13
|
Sun N, Dong R, Pei S, Yin C, Yau SST. A New Method Based on Coding Sequence Density to Cluster Bacteria. J Comput Biol 2020; 27:1688-1698. [PMID: 32392428 DOI: 10.1089/cmb.2019.0509] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Bacterial evolution is an important study field, biological sequences are often used to construct phylogenetic relationships. Multiple sequence alignment is very time-consuming and cannot deal with large scales of bacterial genome sequences in a reasonable time. Hence, a new mathematical method, joining density vector method, is proposed to cluster bacteria, which characterizes the features of coding sequence (CDS) in a DNA sequence. Coding sequences carry genetic information that can synthesize proteins. The correspondence between a genomic sequence and its joining density vector (JDV) is one-to-one. JDV reflects the statistical characteristics of genomic sequence and large amounts of data can be analyzed using this new approach. We apply the novel method to do phylogenetic analysis on four bacterial data sets at hierarchies of genus and species. The phylogenetic trees prove that our new method accurately describes the evolutionary relationships of bacterial coding sequences, and is faster than ClustalW and the existing alignment-free methods.
Collapse
Affiliation(s)
- Nan Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, Illinois, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
14
|
Farkaš T, Sitarčík J, Brejová B, Lucká M. SWSPM: A Novel Alignment-Free DNA Comparison Method Based on Signal Processing Approaches. Evol Bioinform Online 2019; 15:1176934319849071. [PMID: 31210725 PMCID: PMC6545658 DOI: 10.1177/1176934319849071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 04/12/2019] [Indexed: 11/16/2022] Open
Abstract
Computing similarity between 2 nucleotide sequences is one of the fundamental problems in bioinformatics. Current methods are based mainly on 2 major approaches: (1) sequence alignment, which is computationally expensive, and (2) faster, but less accurate, alignment-free methods based on various statistical summaries, for example, short word counts. We propose a new distance measure based on mathematical transforms from the domain of signal processing. To tolerate large-scale rearrangements in the sequences, the transform is computed across sliding windows. We compare our method on several data sets with current state-of-art alignment-free methods. Our method compares favorably in terms of accuracy and outperforms other methods in running time and memory requirements. In addition, it is massively scalable up to dozens of processing units without the loss of performance due to communication overhead. Source files and sample data are available at https://bitbucket.org/fiitstubioinfo/swspm/src.
Collapse
Affiliation(s)
- Tomáš Farkaš
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| | - Jozef Sitarčík
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| | - Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Mária Lucká
- Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| |
Collapse
|
15
|
Dong R, He L, He RL, Yau SST. A Novel Approach to Clustering Genome Sequences Using Inter-nucleotide Covariance. Front Genet 2019; 10:234. [PMID: 31024610 PMCID: PMC6465635 DOI: 10.3389/fgene.2019.00234] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Accepted: 03/04/2019] [Indexed: 11/30/2022] Open
Abstract
Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.
Collapse
Affiliation(s)
- Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Lily He
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, United States
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
16
|
An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes. PLoS One 2018; 13:e0206409. [PMID: 30427878 PMCID: PMC6235296 DOI: 10.1371/journal.pone.0206409] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 10/14/2018] [Indexed: 01/11/2023] Open
Abstract
For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.
Collapse
|
17
|
Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics 2018; 111:1298-1305. [PMID: 30195069 DOI: 10.1016/j.ygeno.2018.08.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 08/19/2018] [Accepted: 08/27/2018] [Indexed: 11/22/2022]
Abstract
Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.
Collapse
|
18
|
Adetiba E, Olugbara OO, Taiwo TB, Adebiyi MO, Badejo JA, Akanle MB, Matthews VO. Alignment-Free Z-Curve Genomic Cepstral Coefficients and Machine Learning for Classification of Viruses. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2018. [PMCID: PMC7120486 DOI: 10.1007/978-3-319-78723-7_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Accurate detection of pathogenic viruses has become highly imperative. This is because viral diseases constitute a huge threat to human health and wellbeing on a global scale. However, both traditional and recent techniques for viral detection suffer from various setbacks. In codicil, some of the existing alignment-free methods are also limited with respect to viral detection accuracy. In this paper, we present the development of an alignment-free, digital signal processing based method for pathogenic viral detection named Z-Curve Genomic Cesptral Coefficients (ZCGCC). To evaluate the method, ZCGCC were computed from twenty six pathogenic viral strains extracted from the ViPR corpus. Naïve Bayesian classifier, which is a popular machine learning method was experimentally trained and validated using the extracted ZCGCC and other alignment-free methods in the literature. Comparative results show that the proposed ZCGCC gives good accuracy (93.0385%) and improved performance to existing alignment-free methods.
Collapse
|
19
|
Dong R, Zheng H, Tian K, Yau SC, Mao W, Yu W, Yin C, Yu C, He RL, Yang J, Yau SS. Virus Database and Online Inquiry System Based on Natural Vectors. Evol Bioinform Online 2017; 13:1176934317746667. [PMID: 29308007 PMCID: PMC5751915 DOI: 10.1177/1176934317746667] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2017] [Accepted: 10/05/2017] [Indexed: 01/09/2023] Open
Abstract
We construct a virus database called VirusDB (http://yaulab.math.tsinghua.edu.cn/VirusDB/) and an online inquiry system to serve people who are interested in viral classification and prediction. The database stores all viral genomes, their corresponding natural vectors, and the classification information of the single/multiple-segmented viral reference sequences downloaded from National Center for Biotechnology Information. The online inquiry system serves the purpose of computing natural vectors and their distances based on submitted genomes, providing an online interface for accessing and using the database for viral classification and prediction, and back-end processes for automatic and manual updating of database content to synchronize with GenBank. Submitted genomes data in FASTA format will be carried out and the prediction results with 5 closest neighbors and their classifications will be returned by email. Considering the one-to-one correspondence between sequence and natural vector, time efficiency, and high accuracy, natural vector is a significant advance compared with alignment methods, which makes VirusDB a useful database in further research.
Collapse
Affiliation(s)
- Rui Dong
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Hui Zheng
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL, USA
| | - Kun Tian
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Shek-Chung Yau
- Information Technology Services Center, The Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Weiguang Mao
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Wenping Yu
- College of Computer and Control Engineering, Nankai University, Tianjin, China
| | - Changchuan Yin
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL, USA
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA, Australia.,School of Medicine, Flinders University, Adelaide, SA, Australia
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL, USA
| | - Jie Yang
- Department of Mathematics, Statistics, and Computer Science, The University of Illinois at Chicago, Chicago, IL, USA
| | - Stephen St Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| |
Collapse
|
20
|
Abstract
With sharp increasing in biological sequences, the traditional sequence alignment methods become unsuitable and infeasible. It motivates a surge of fast alignment-free techniques for sequence analysis. Among these methods, many sorts of feature vector methods are established and applied to reconstruction of species phylogeny. The vectors basically consist of some typical numerical features for certain biological problems. The features may come from the primary sequences, secondary or three dimensional structures of macromolecules. In this study, we propose a novel numerical vector based on only primary sequences of organism to build their phylogeny. Three chemical and physical properties of primary sequences: purine, pyrimidine and keto are also incorporated to the vector. Using each property, we convert the nucleotide sequence into a new sequence consisting of only two kinds of letters. Therefore, three sequences are constructed according to the three properties. For each letter of each sequence we calculate the number of the letter, the average position of the letter and the variation of the position of the letter appearing in the sequence. Tested on several datasets related to mammals, viruses and bacteria, this new tool is fast in speed and accurate for inferring the phylogeny of organisms.
Collapse
|
21
|
He L, Li Y, He RL, Yau SST. A novel alignment-free vector method to cluster protein sequences. J Theor Biol 2017; 427:41-52. [DOI: 10.1016/j.jtbi.2017.06.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Revised: 05/04/2017] [Accepted: 06/02/2017] [Indexed: 11/29/2022]
|
22
|
Li Y, He L, He RL, Yau SST. Zika and Flaviviruses Phylogeny Based on the Alignment-Free Natural Vector Method. DNA Cell Biol 2016; 36:109-116. [PMID: 27977308 DOI: 10.1089/dna.2016.3532] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Zika virus (ZIKV) is a mosquito-borne flavivirus. It was first isolated from Uganda in 1947 and has become an emergent event since 2007. However, because of the inconsistency of alignment methods, the evolution of ZIKV remains poorly understood. In this study, we first use the complete protein and an alignment-free method to build a phylogenetic tree of 87 Zika strains in which Asian, East African, and West African lineages are characterized. We also use the NS5 protein to construct the genetic relationship among 44 Zika strains. For the first time, these strains are divided into two clades: African 1 and African 2. This result suggests that ZIKV originates from Africa, then spread to Asia, Pacific islands, and throughout the Americas. We also perform the phylogeny analysis for 53 viruses in genus Flavivirus to which ZIKV belongs using complete proteins. Our conclusion is consistent with the classification by the hosts and transmission vectors.
Collapse
Affiliation(s)
- Yongkun Li
- 1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China
| | - Lily He
- 1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China
| | - Rong Lucy He
- 2 Department of Biological Sciences, Chicago State University , Chicago, Illinois
| | - Stephen S-T Yau
- 1 Department of Mathematical Sciences, Tsinghua University , Beijing, People's Republic of China
| |
Collapse
|
23
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|