1
|
Cahuantzi R, Lythgoe KA, Hall I, Pellis L, House T. Unsupervised identification of significant lineages of SARS-CoV-2 through scalable machine learning methods. Proc Natl Acad Sci U S A 2024; 121:e2317284121. [PMID: 38478692 PMCID: PMC10962941 DOI: 10.1073/pnas.2317284121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024] Open
Abstract
Since its emergence in late 2019, SARS-CoV-2 has diversified into a large number of lineages and caused multiple waves of infection globally. Novel lineages have the potential to spread rapidly and internationally if they have higher intrinsic transmissibility and/or can evade host immune responses, as has been seen with the Alpha, Delta, and Omicron variants of concern. They can also cause increased mortality and morbidity if they have increased virulence, as was seen for Alpha and Delta. Phylogenetic methods provide the "gold standard" for representing the global diversity of SARS-CoV-2 and to identify newly emerging lineages. However, these methods are computationally expensive, struggle when datasets get too large, and require manual curation to designate new lineages. These challenges provide a motivation to develop complementary methods that can incorporate all of the genetic data available without down-sampling to extract meaningful information rapidly and with minimal curation. In this paper, we demonstrate the utility of using algorithmic approaches based on word-statistics to represent whole sequences, bringing speed, scalability, and interpretability to the construction of genetic topologies. While not serving as a substitute for current phylogenetic analyses, the proposed methods can be used as a complementary, and fully automatable, approach to identify and confirm new emerging variants.
Collapse
Affiliation(s)
- Roberto Cahuantzi
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
- United Kingdom Health Security Agency, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Katrina A. Lythgoe
- Department of Biology, University of Oxford, OxfordOX1 3SZ, United Kingdom
- Big Data Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
- Pandemic Sciences Institute, University of Oxford, OxfordOX3 7LF, United Kingdom
| | - Ian Hall
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Lorenzo Pellis
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| | - Thomas House
- Department of Mathematics, The University of Manchester, ManchesterM13 9PL, United Kingdom
| |
Collapse
|
2
|
Ji G, Hu G, Liu G, Bai Z, Li B, Li D, L H, Cui G. Response of soil microbes to Carex meyeriana meadow degeneration caused by overgrazing in inner Mongolia. ACTA OECOLOGICA 2022. [DOI: 10.1016/j.actao.2022.103860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
3
|
Liu J, Xia KL, Wu J, Yau SST, Wei GW. Biomolecular Topology: Modelling and Analysis. ACTA MATHEMATICA SINICA, ENGLISH SERIES 2022; 38:1901-1938. [PMID: 36407804 PMCID: PMC9640850 DOI: 10.1007/s10114-022-2326-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 07/12/2022] [Indexed: 05/25/2023]
Abstract
With the great advancement of experimental tools, a tremendous amount of biomolecular data has been generated and accumulated in various databases. The high dimensionality, structural complexity, the nonlinearity, and entanglements of biomolecular data, ranging from DNA knots, RNA secondary structures, protein folding configurations, chromosomes, DNA origami, molecular assembly, to others at the macromolecular level, pose a severe challenge in their analysis and characterization. In the past few decades, mathematical concepts, models, algorithms, and tools from algebraic topology, combinatorial topology, computational topology, and topological data analysis, have demonstrated great power and begun to play an essential role in tackling the biomolecular data challenge. In this work, we introduce biomolecular topology, which concerns the topological problems and models originated from the biomolecular systems. More specifically, the biomolecular topology encompasses topological structures, properties and relations that are emerged from biomolecular structures, dynamics, interactions, and functions. We discuss the various types of biomolecular topology from structures (of proteins, DNAs, and RNAs), protein folding, and protein assembly. A brief discussion of databanks (and databases), theoretical models, and computational algorithms, is presented. Further, we systematically review related topological models, including graphs, simplicial complexes, persistent homology, persistent Laplacians, de Rham-Hodge theory, Yau-Hausdorff distance, and the topology-based machine learning models.
Collapse
Affiliation(s)
- Jian Liu
- School of Mathematical Sciences, Hebei Normal University, Shijiazhuang, 050024 P. R. China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
| | - Ke-Lin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, 639798 Singapore
| | - Jie Wu
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 P. R. China
| | - Stephen Shing-Toung Yau
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 P. R. China
| | - Guo-Wei Wei
- Department of Mathematics & Department of Biochemistry and Molecular Biology & Department of Electrical and Computer Engineering, Michigan State University, Wells Hall 619 Red Cedar Road, East Lansing, MI 48824-1027 USA
| |
Collapse
|
4
|
Zhang B, Li X, Saldanha-da-Gama F. Free-Floating Bike-Sharing Systems: New Repositioning Rules, Optimization Models and Solution Algorithms. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.03.028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
5
|
Wan X, Tan X. A protein structural study based on the centrality analysis of protein sequence feature networks. PLoS One 2021; 16:e0248861. [PMID: 33780482 PMCID: PMC8006989 DOI: 10.1371/journal.pone.0248861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 03/05/2021] [Indexed: 11/19/2022] Open
Abstract
In this paper, we use network approaches to analyze the relations between protein sequence features for the top hierarchical classes of CATH and SCOP. We use fundamental connectivity measures such as correlation (CR), normalized mutual information rate (nMIR), and transfer entropy (TE) to analyze the pairwise-relationships between the protein sequence features, and use centrality measures to analyze weighted networks constructed from the relationship matrices. In the centrality analysis, we find both commonalities and differences between the different protein 3D structural classes. Results show that all top hierarchical classes of CATH and SCOP present strong non-deterministic interactions for the composition and arrangement features of Cystine (C), Methionine (M), Tryptophan (W), and also for the arrangement features of Histidine (H). The different protein 3D structural classes present different preferences in terms of their centrality distributions and significant features.
Collapse
Affiliation(s)
- Xiaogeng Wan
- College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, China
- * E-mail:
| | - Xinying Tan
- The Fourth Center of PLA General Hospital, Beijing, China
| |
Collapse
|
6
|
Wan X, Tan X. A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200305090055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Protein is a kind of important organics in life. It is varied with its
sequences, structures and functions. Protein evolutionary classification is one of the popular
research topics in computational bioinformatics. Many studies have used protein sequence
information to classify the evolutionary relationships of proteins. As the amount of protein
sequence data increases, efficient computational tools are needed to make efficient protein
evolutionary classifications with high accuracies in the big data paradigm.
Methods:
In this study, we propose a new simple and efficient computational approach based on
the normalized mutual information rates to compute the relationship between protein sequences,
we then use the “distances” defined on the relationships to perform the evolutionary classifications
of proteins. The new method is computational efficient, model-free and unsupervised, which does
not require training data when performing classifications.
Result:
Simulation studies on various examples demonstrate the efficiency of the new method.
We use precision-recall curves to compare the efficiency of our new method with traditional
methods, results show that the new method outperforms the traditional methods in most of the
cases when performing evolutionary classifications.
Conclusion:
The new method is simple and proved to be efficient in protein evolutionary
classifications, which is useful in future evolutionary analysis particularly in the big data paradigm.
Collapse
Affiliation(s)
- Xiaogeng Wan
- Department of Mathematics, College of Mathematics and Physics, Beijing University of Chemical Technology, Beijing, 100029, China
| | - Xinying Tan
- The Fourth Center of PLA General Hospital, Beijing, 100037, China
| |
Collapse
|
7
|
Tian K, Zhao X, Wan X, Yau SST. Amino acid torsion angles enable prediction of protein fold classification. Sci Rep 2020; 10:21773. [PMID: 33303802 PMCID: PMC7729947 DOI: 10.1038/s41598-020-78465-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Accepted: 11/23/2020] [Indexed: 11/29/2022] Open
Abstract
Protein structure can provide insights that help biologists to predict and understand protein functions and interactions. However, the number of known protein structures has not kept pace with the number of protein sequences determined by high-throughput sequencing. Current techniques used to determine the structure of proteins are complex and require a lot of time to analyze the experimental results, especially for large protein molecules. The limitations of these methods have motivated us to create a new approach for protein structure prediction. Here we describe a new approach to predict of protein structures and structure classes from amino acid sequences. Our prediction model performs well in comparison with previous methods when applied to the structural classification of two CATH datasets with more than 5000 protein domains. The average accuracy is 92.5% for structure classification, which is higher than that of previous research. We also used our model to predict four known protein structures with a single amino acid sequence, while many other existing methods could only obtain one possible structure for a given sequence. The results show that our method provides a new effective and reliable tool for protein structure prediction research.
Collapse
Affiliation(s)
- Kun Tian
- School of Mathematics, Renmin University of China, Beijing, 100872, People's Republic of China
| | - Xin Zhao
- Department of Cryptography and Technology, Beijing Electronic Science and Technology Institute, Beijing, 100070, People's Republic of China
| | - Xiaogeng Wan
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, People's Republic of China
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084, People's Republic of China.
| |
Collapse
|
8
|
Qiangrong J, Guang Q. Graph kernels combined with the neural network on protein classification. J Bioinform Comput Biol 2019; 17:1950030. [PMID: 31856667 DOI: 10.1142/s0219720019500306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
At present, most of the researches on protein classification are based on graph kernels. The essence of graph kernels is to extract the substructure and use the similarity of substructures as the kernel values. In this paper, we propose a novel graph kernel named vertex-edge similarity kernel (VES kernel) based on mixed matrix, the innovation point of which is taking the adjacency matrix of the graph as the sample vector of each vertex and calculating kernel values by finding the most similar vertex pair of two graphs. In addition, we combine the novel kernel with the neural network and the experimental results show that the combination is better than the existing advanced methods.
Collapse
Affiliation(s)
- Jiang Qiangrong
- Department of Computer Science, Beijing University of Technology, Beijing, P. R. China
| | - Qiu Guang
- Department of Computer Science, Beijing University of Technology, Beijing, P. R. China
| |
Collapse
|
9
|
Zhao X, Tian K, He RL, Yau SST. Convex hull principle for classification and phylogeny of eukaryotic proteins. Genomics 2019; 111:1777-1784. [DOI: 10.1016/j.ygeno.2018.11.033] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2018] [Revised: 11/25/2018] [Accepted: 11/30/2018] [Indexed: 12/11/2022]
|
10
|
Mu Z, Yu T, Qi E, Liu J, Li G. DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information. BMC Bioinformatics 2019; 20:351. [PMID: 31221087 PMCID: PMC6587251 DOI: 10.1186/s12859-019-2943-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Accepted: 06/10/2019] [Indexed: 12/01/2022] Open
Abstract
BACKGROUND Protein feature extraction plays an important role in the areas of similarity analysis of protein sequences and prediction of protein structures, functions and interactions. The feature extraction based on graphical representation is one of the most effective and efficient ways. However, most existing methods suffer limitations from their method design. RESULTS We introduce DCGR, a novel method for extracting features from protein sequences based on the chaos game representation, which is developed by constructing CGR curves of protein sequences according to physicochemical properties of amino acids, followed by converting the CGR curves into multi-dimensional feature vectors by using the distributions of points in CGR images. Tested on five data sets, DCGR was significantly superior to the state-of-the-art feature extraction methods. CONCLUSION The DCGR is practically powerful for extracting effective features from protein sequences, and therefore important in similarity analysis of protein sequences, study of protein-protein interactions and prediction of protein functions. It is freely available at https://sourceforge.net/projects/transcriptomeassembly/files/Feature%20Extraction .
Collapse
Affiliation(s)
- Zengchao Mu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Ting Yu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Enfeng Qi
- College of Mathematics and Statistics, Guangxi Normal University, Guilin, 541001 China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, 250100 Shandong Province China
| |
Collapse
|
11
|
Tian K, Zhao X, Zhang Y, Yau S. Comparing protein structures and inferring functions with a novel three-dimensional Yau-Hausdorff method. J Biomol Struct Dyn 2018; 37:4151-4160. [PMID: 30518311 DOI: 10.1080/07391102.2018.1540359] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Structures and functions of proteins play various essential roles in biological processes. The functions of newly discovered proteins can be predicted by comparing their structures with that of known-functional proteins. Many approaches have been proposed for measuring the protein structure similarity, such as the template-modeling (TM)-score method, GRaphlet (GR)-Align method as well as the commonly used root-mean-square deviation (RMSD) measures. However, the alignment comparisons between the similarity of protein structure cost much time on large dataset, and the accuracy still have room to improve. In this study, we introduce a new three-dimensional (3D) Yau-Hausdorff distance between any two 3D objects. The (3D) Yau-Hausdorff distance can be used in particular to measure the similarity/dissimilarity of two proteins of any size and does not need aligning and superimposing two structures. We apply structural similarity to study function similarity and perform phylogenetic analysis on several datasets. The results show that (3D) Yau-Hausdorff distance could serve as a more precise and effective method to discover biological relationships between proteins than other methods on structure comparison. Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Kun Tian
- Department of Mathematical Sciences, Tsinghua University , Beijing , P.R. China
| | - Xin Zhao
- Department of Mathematical Sciences, Tsinghua University , Beijing , P.R. China
| | - Yuning Zhang
- School of Life Sciences, Tsinghua University , Beijing , P.R. China
| | - Stephen Yau
- Department of Mathematical Sciences, Tsinghua University , Beijing , P.R. China
| |
Collapse
|
12
|
Tian K, Zhao X, Yau SST. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. J Theor Biol 2018; 456:34-40. [DOI: 10.1016/j.jtbi.2018.07.035] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Revised: 07/23/2018] [Accepted: 07/25/2018] [Indexed: 11/28/2022]
|
13
|
Tahir M, Hayat M, Khan SA. iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou's PseAAC to pseudo-tri-nucleotide composition. Mol Genet Genomics 2018; 294:199-210. [PMID: 30291426 DOI: 10.1007/s00438-018-1498-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2018] [Accepted: 09/28/2018] [Indexed: 10/28/2022]
Abstract
Nucleosome is a central element of eukaryotic chromatin, which composes of histone proteins and DNA molecules. It performs vital roles in many eukaryotic intra-nuclear processes, for instance, chromatin structure and transcriptional regulation formation. Identification of nucleosome positioning via wet lab is difficult; so, the attention is diverted towards the accurate intelligent automated prediction. In this regard, a novel intelligent automated model "iNuc-ext-PseTNC" is developed to identify the nucleosome positioning in genomes accurately. In this predictor, the sequences of DNA are mathematically represented by two different discrete feature extraction techniques, namely pseudo-tri-nucleotide composition (PseTNC) and pseudo-di-nucleotide composition. Several contemporary machine learning algorithms were examined. Further, the predictions of individual classifiers were integrated through an evolutionary genetic algorithm. The success rates of the ensemble model are higher than individual classifiers. After analyzing the prediction results, it is noticed that iNuc-ext-PseTNC model has achieved better performance in combination with PseTNC feature space, which are 94.3%, 93.14%, and 88.60% of accuracies using six-fold cross-validation test for the three benchmark datasets S1, S2, and S3, respectively. The achieved outcomes exposed that the results of iNuc-ext-PseTNC model are prominent compared to the existing methods so far notifiable in the literature. It is ascertained that the proposed model might be more fruitful and a practical tool for rudimentary academia and research.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan.
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, KP, Pakistan
| |
Collapse
|
14
|
Cang Z, Mu L, Wei GW. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput Biol 2018; 14:e1005929. [PMID: 29309403 PMCID: PMC5774846 DOI: 10.1371/journal.pcbi.1005929] [Citation(s) in RCA: 150] [Impact Index Per Article: 21.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Revised: 01/19/2018] [Accepted: 12/15/2017] [Indexed: 12/05/2022] Open
Abstract
This work introduces a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homology, and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. In contrast to the conventional persistent homology, multi-component persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding analysis and virtual screening of small molecules. Extensive numerical experiments involving 4,414 protein-ligand complexes from the PDBBind database and 128,374 ligand-target and decoy-target pairs in the DUD database are performed to test respectively the scoring power and the discriminatory power of the proposed topological learning strategies. It is demonstrated that the present topological learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination.
Collapse
Affiliation(s)
- Zixuan Cang
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
| | - Lin Mu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
15
|
Zhao X, Tian K, He RL, Yau SST. Establishing the phylogeny of Prochlorococcus with a new alignment-free method. Ecol Evol 2017; 7:11057-11065. [PMID: 29299281 PMCID: PMC5743538 DOI: 10.1002/ece3.3535] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 09/04/2017] [Accepted: 09/14/2017] [Indexed: 11/11/2022] Open
Abstract
Prochlorococcus marinus, one of the most abundant marine cyanobacteria in the global ocean, is classified into low-light (LL) and high-light (HL) adapted ecotypes. These two adapted ecotypes differ in their ecophysiological characteristics, especially whether adapted for growth at high-light or low-light intensities. However, some evolutionary relationships of Prochlorococcus phylogeny remain to be resolved, such as whether the strains SS120 and MIT9211 form a monophyletic group. We use the Natural Vector (NV) method to represent the sequence in order to identify the phylogeny of the Prochlorococcus. The natural vector method is alignment free without any model assumptions. This study added the covariances of amino acids in protein sequence to the natural vector method. Based on these new natural vectors, we can compute the Hausdorff distance between the two clades which represents the dissimilarity. This method enables us to systematically analyze both the dataset of ribosomal proteomes and the dataset of 16s-23s rRNA sequences in order to reconstruct the phylogeny of Prochlorococcus. Furthermore, we apply classification to inspect the relationship of SS120 and MIT9211. From the reconstructed phylogenetic trees and classification results, we may conclude that the SS120 does not cluster with MIT9211. This study demonstrates a new method for performing phylogenetic analysis. The results confirm that these two strains do not form a monophyletic clade in the phylogeny of Prochlorococcus.
Collapse
Affiliation(s)
- Xin Zhao
- Department of Mathematical Sciences Tsinghua University Beijing China
| | - Kun Tian
- Department of Mathematical Sciences Tsinghua University Beijing China
| | - Rong L He
- Department of Biological Sciences Chicago State University Chicago IL USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences Tsinghua University Beijing China
| |
Collapse
|
16
|
Tahir M, Hayat M, Kabir M. Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 146:69-75. [PMID: 28688491 DOI: 10.1016/j.cmpb.2017.05.008] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 05/05/2017] [Accepted: 05/19/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVES Enhancers are pivotal DNA elements, which are widely used in eukaryotes for activation of transcription genes. On the basis of enhancer strength, they are further classified into two groups; strong enhancers and weak enhancers. Due to high availability of huge amount of DNA sequences, it is needed to develop fast, reliable and robust intelligent computational method, which not only identify enhancers but also determines their strength. Considerable progress has been achieved in this regard; however, timely and precisely identification of enhancers is still a challenging task. METHODS Two-level intelligent computational model for identification of enhancers and their subgroups is proposed. Two different feature extraction techniques including di-nucleotide composition and tri-nucleotide composition were adopted for extraction of numerical descriptors. Four classification methods including probabilistic neural network, support vector machine, k-nearest neighbor and random forest were utilized for classification. RESULTS The proposed method yielded 77.25% of accuracy for dataset S1 contains enhancers and non-enhancers, whereas 64.70% of accuracy for dataset S2 comprises of strong enhancer and weak enhancer sequences using jackknife cross-validation test. CONCLUSION The predictive results validated that the proposed method is better than that of existing approaches so far reported in the literature. It is thus highly observed that the developed method will be useful and expedient for basic research and academia.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Computer Science, Abdul Wali Khan University Mardan, KP Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, KP Pakistan.
| | - Muhammad Kabir
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
| |
Collapse
|
17
|
Wan X, Zhao X, Yau SST. An information-based network approach for protein classification. PLoS One 2017; 12:e0174386. [PMID: 28350835 PMCID: PMC5370107 DOI: 10.1371/journal.pone.0174386] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2016] [Accepted: 03/08/2017] [Indexed: 11/25/2022] Open
Abstract
Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method.
Collapse
Affiliation(s)
- Xiaogeng Wan
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- * E-mail: (XW); (XZ); (SSTY)
| | - Xin Zhao
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- * E-mail: (XW); (XZ); (SSTY)
| | - Stephen S. T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
- * E-mail: (XW); (XZ); (SSTY)
| |
Collapse
|
18
|
Zhang G, Dai M, Yang L, Li W, Li H, Xu C, Shi X, Dong X, Fu F. Fast detection and data compensation for electrodes disconnection in long-term monitoring of dynamic brain electrical impedance tomography. Biomed Eng Online 2017; 16:7. [PMID: 28086909 PMCID: PMC5234124 DOI: 10.1186/s12938-016-0294-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2016] [Accepted: 12/04/2016] [Indexed: 11/18/2022] Open
Abstract
Background Electrode disconnection is a common occurrence during long-term monitoring of brain electrical impedance tomography (EIT) in clinical settings. The data acquisition system suffers remarkable data loss which results in image reconstruction failure. The aim of this study was to: (1) detect disconnected electrodes and (2) account for invalid data. Methods Weighted correlation coefficient for each electrode was calculated based on the measurement differences between well-connected and disconnected electrodes. Disconnected electrodes were identified by filtering out abnormal coefficients with discrete wavelet transforms. Further, previously valid measurements were utilized to establish grey model. The invalid frames after electrode disconnection were substituted with the data estimated by grey model. The proposed approach was evaluated on resistor phantom and with eight patients in clinical settings. Results The proposed method was able to detect 1 or 2 disconnected electrodes with an accuracy of 100%; to detect 3 and 4 disconnected electrodes with accuracy of 92 and 84% respectively. The time cost of electrode detection was within 0.018 s. Further, the proposed method was capable to compensate at least 60 subsequent frames of data and restore the normal image reconstruction within 0.4 s and with a mean relative error smaller than 0.01%. Conclusions In this paper, we proposed a two-step approach to detect multiple disconnected electrodes and to compensate the invalid frames of data after disconnection. Our method is capable of detecting more disconnected electrodes with higher accuracy compared to methods proposed in previous studies. Further, our method provides estimations during the faulty measurement period until the medical staff reconnects the electrodes. This work would improve the clinical practicability of dynamic brain EIT and contribute to its further promotion.
Collapse
Affiliation(s)
- Ge Zhang
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Meng Dai
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Lin Yang
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Weichen Li
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Haoting Li
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Canhua Xu
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Xuetao Shi
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China
| | - Xiuzhen Dong
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China.
| | - Feng Fu
- Department of Biomedical Engineering, Fourth Military Medical University, Xi'an, China.
| |
Collapse
|
19
|
Li Y, Tian K, Yin C, He RL, Yau SST. Virus classification in 60-dimensional protein space. Mol Phylogenet Evol 2016; 99:53-62. [PMID: 26988414 DOI: 10.1016/j.ympev.2016.03.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Revised: 01/24/2016] [Accepted: 03/10/2016] [Indexed: 10/22/2022]
Abstract
Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.
Collapse
Affiliation(s)
- Yongkun Li
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Kun Tian
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Changchuan Yin
- Department of Mathematics, Statistics and Computer Science, The University of Illinois at Chicago, Chicago, IL 60607-7045, USA
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|
20
|
Zhao X, Wan X, He RL, Yau SST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Mol Phylogenet Evol 2016; 98:271-9. [PMID: 26926946 DOI: 10.1016/j.ympev.2016.02.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2015] [Revised: 02/18/2016] [Accepted: 02/18/2016] [Indexed: 12/14/2022]
Abstract
The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.
Collapse
Affiliation(s)
- Xin Zhao
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Xiaogeng Wan
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China
| | - Rong L He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China.
| |
Collapse
|