1
|
Huang J, Dai Q, Yao Y, He PA. A Generalized Iterative Map for Analysis of Protein Sequences. Comb Chem High Throughput Screen 2020; 25:381-391. [PMID: 33045963 DOI: 10.2174/1386207323666201012142318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 07/30/2020] [Accepted: 08/09/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The similarities comparison of biological sequences is an important task in bioinformatics. The methods of the similarities comparison for biological sequences are divided into two classes: sequence alignment method and alignment-free method. The graphical representation of biological sequences is a kind of alignment-free method, which constitutes a tool for analyzing and visualizing the biological sequences. In this article, a generalized iterative map of protein sequences was suggested to analyze the similarities of biological sequences. MATERIALS AND METHODS Based on the normalized physicochemical indexes of 20 amino acids, each amino acid can be mapped into a point in 5D space. A generalized iterative function system was introduced to outline a generalized iterative map of protein sequences, which can not only reflect various physicochemical properties of amino acids but also incorporate with different compression ratios of the component of a generalized iterative map. Several properties were proved to illustrate the advantage of the generalized iterative map. The mathematical description of the generalized iterative map was suggested to compare the similarities and dissimilarities of protein sequences. Based on this method, similarities/dissimilarities were compared among ND5 protein sequences, as well as ND6 protein sequences of ten different species. RESULTS By correlation analysis, the ClustalW results were compared with our similarity/dissimilarity results and other graphical representation results to show the utility of our approach. The comparison results show that our approach has better correlations with ClustalW for all species than other approaches and illustrate the effectiveness of our approach. CONCLUSION Two examples show that our method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation.
Collapse
Affiliation(s)
- Jiahe Huang
- School of Science, Zhejiang Sci-Tech University, Hangzhou,China
| | - Qi Dai
- College of Life Science, Zhejiang Sci-Tech University, Hangzhou,China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou,China
| | - Ping-An He
- School of Science, Zhejiang Sci-Tech University, Hangzhou,China
| |
Collapse
|
2
|
Zhao Y, Xue X, Xie X. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison. Comput Biol Chem 2019; 80:10-15. [PMID: 30851619 DOI: 10.1016/j.compbiolchem.2019.01.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2018] [Revised: 12/30/2018] [Accepted: 01/17/2019] [Indexed: 01/21/2023]
Abstract
Sequence comparison is an important topic in bioinformatics. With the exponential increase of biological sequences, the traditional protein sequence comparison methods - the alignment methods become limited, so the alignment-free methods are widely proposed in the past two decades. In this paper, we considered not only the six typical physicochemical properties of amino acids, but also their frequency and positional distribution. A 51-dimensional vector was obtained to describe the protein sequence. We got a pairwise distance matrix by computing the standardized Euclidean distance, and discriminant analysis and phylogenetic analysis can be made. The results on the Influenza A virus and ND5 datasets indicate that our method is accurate and efficient for classifying proteins and inferring the phylogeny of species.
Collapse
Affiliation(s)
- Yunxiu Zhao
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China
| | - Xiaolong Xue
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China
| | - Xiaoli Xie
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China.
| |
Collapse
|
3
|
Identifying anticancer peptides by using a generalized chaos game representation. J Math Biol 2018; 78:441-463. [PMID: 30291366 DOI: 10.1007/s00285-018-1279-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2017] [Revised: 08/01/2018] [Indexed: 10/28/2022]
Abstract
We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
Collapse
|
4
|
Hou W, Pan Q, Peng Q, He M. A new method to analyze protein sequence similarity using Dynamic Time Warping. Genomics 2016; 109:123-130. [PMID: 27974244 PMCID: PMC7125777 DOI: 10.1016/j.ygeno.2016.12.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2016] [Revised: 12/06/2016] [Accepted: 12/10/2016] [Indexed: 12/05/2022]
Abstract
Sequences similarity analysis is one of the major topics in bioinformatics. It helps researchers to reveal evolution relationships of different species. In this paper, we outline a new method to analyze the similarity of proteins by Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW). The original symbol sequences are converted to numerical sequences according to their physico-chemical properties. We obtain the power spectra of sequences from DFT and extend the spectra to the same length to calculate the distance between different sequences by DTW. Our method is tested in different datasets and the results are compared with that of other software algorithms. In the comparison we find our scheme could amend some wrong classifications appear in other software. The comparison shows our approach is reasonable and effective. We propose a novel method to extract the features of the sequences based on physicochemical property of proteins. We apply the Discrete Fourier Transform (DFT) and Dynamic Time Warping (DTW) to analyze the similarity of proteins. Different datasets are used to prove our model's effectiveness.
Collapse
Affiliation(s)
- Wenbing Hou
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qiuhui Pan
- School of Innovation and Entrepreneurship, Dalian University of Technology, Dalian 116024, PR China; School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China
| | - Qianying Peng
- Department of Academics, Dalian Naval Academy, Dalian 116001, PR China
| | - Mingfeng He
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China.
| |
Collapse
|
5
|
El-Lakkani A, Lashin M. An efficient method for measuring the similarity of protein sequences. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2016; 27:363-370. [PMID: 27103219 DOI: 10.1080/1062936x.2016.1174735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2016] [Accepted: 04/01/2016] [Indexed: 06/05/2023]
Abstract
An accurate numerical descriptor for protein sequence is introduced. It is basically a set of each three successive amino acids in the sequence (triplet), starting from left to right, in addition to the distances between each two successive amino acids in the triplet such that the summation of these distances does not exceed 8. This numerical descriptor combines two features the amino acid composition and the position of each amino acid relative to the other nearby amino acids. This numerical descriptor is used to measure the similarity between protein sequences in three sets: NADH dehydrogenase subunit 5 (ND5) proteins of different species, 24 transferrin proteins from vertebrates and 12 proteins of baculoviruses. High correlation coefficient values between our results and the results of ClustalW program are obtained. These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| | - M Lashin
- a Faculty of Science, Department of Biophysics , Cairo University , Giza , Egypt
| |
Collapse
|
6
|
Wang L, Peng H, Zheng J, Qiu Y. A new 2D graphical representation of protein sequence and its application. INT J BIOMATH 2015. [DOI: 10.1142/s1793524515500631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Graphical representation is a very efficient tool for visual analysis of protein sequences. In this paper, a novel 2D graphical representation scheme is proposed on the basis of a newly introduced concept, named characteristic model of the protein sequences. After obtaining the 2D graphics of protein sequences, two numerical characterizations of them is designed as descriptors to analyze the nine DN5 protein sequences, simulation and analysis results show that, comparing with existing methods, our method is not only visible, intuitional, and simple, but also has no circuit or degeneracy, and even more important, since the storage space required by our method is constant and has nothing to do with the length of protein sequences, then it can keep excellent visual inspection for long protein sequences.
Collapse
Affiliation(s)
- Lei Wang
- College of Information Engineering, Xiangtan University, Xiangtan 411105, P. R. China
| | - Hui Peng
- Key Laboratory of Intelligent Computing Information Processing, Ministry of Education, Xiangtan 411105, P. R. China
| | - Jinhua Zheng
- Key Laboratory of Intelligent Computing Information Processing, Ministry of Education, Xiangtan 411105, P. R. China
| | - Yanzi Qiu
- Key Laboratory of Intelligent Computing Information Processing, Ministry of Education, Xiangtan 411105, P. R. China
| |
Collapse
|
7
|
The Graph, Geometry and Symmetries of the Genetic Code with Hamming Metric. Symmetry (Basel) 2015. [DOI: 10.3390/sym7031211] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
8
|
El-Lakkani A, Mahran H. An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2015; 26:125-137. [PMID: 25650529 DOI: 10.1080/1062936x.2014.995700] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
A new two-dimensional graphical representation of protein sequences is introduced. Twenty concentric evenly spaced circles divided by n radial lines into equal divisions are selected to represent any protein sequence of length n. Each circle represents one of the different 20 amino acids, and each radial line represents a single amino acid of the protein sequence. An efficient numerical method based on the graph is proposed to measure the similarity between two protein sequences. To prove the accuracy of our approach, the method is applied to NADH dehydrogenase subunit 5 (ND5) proteins of nine different species and 24 transferrin sequences from vertebrates. High values of correlation coefficient between our results and the results of ClustalW are obtained (approximately perfect correlations). These values are higher than the values obtained in many other related works.
Collapse
Affiliation(s)
- A El-Lakkani
- a Department of Biophysics, Faculty of Science , Cairo University , Giza , Egypt
| | | |
Collapse
|
9
|
Qi ZH, Jin MZ, Li SL, Feng J. A protein mapping method based on physicochemical properties and dimension reduction. Comput Biol Med 2014; 57:1-7. [PMID: 25486446 DOI: 10.1016/j.compbiomed.2014.11.012] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2014] [Revised: 11/15/2014] [Accepted: 11/19/2014] [Indexed: 01/11/2023]
Abstract
BACKGROUND The graphical mapping of a protein sequence is more difficult than the graphical mapping of a DNA sequence because of the twenty amino acids and their complicated physicochemical properties. However, the graphical mapping for protein sequences attracts many researchers to develop different mapping methods. Currently, researchers have proposed their mapping methods based on several physicochemical properties. In this article, a new mapping method for protein sequences is developed by considering additional physicochemical properties, which is a simple and effective approach. METHODS Based on the 12 major physicochemical properties of amino acids and the PCA method, we propose a simple and intuitive 2D graphical mapping method for protein sequences. Next, we extract a 20D vector from the graphical mapping which is used to characterize a protein sequence. RESULTS The proposed graphical mapping consists of three important properties, one-to-one, no circuit, and good visualization. This mapping contains more physicochemical information. Next, this proposed method is applied to two separate applications. The results illustrate the utility of the proposed method. DISCUSSION To validate the proposed method, we first give a comparison of protein sequences, which consists of nine ND6 proteins. The similarity/dissimilarity matrix for the ssnine ND6 proteins correctly reveals their evolutionary relationship. Next, we give another application for the cluster analysis of HA genes of influenza A (H1N1) isolates. The results are consistent with the known evolution fact of the H1N1 virus. The separate applications further illustrate the utility of the proposed method.
Collapse
Affiliation(s)
- Zhao-Hui Qi
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China.
| | - Meng-Zhe Jin
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China
| | - Su-Li Li
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China
| | - Jun Feng
- College of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, Hebei, 050043, People's Republic of China
| |
Collapse
|
10
|
Bai Y, Ma T, Yao Y, Dai Q, He PA. Phylogenetic analysis of H7N9 avian influenza virus based on a novel mathematical descriptor. BIOMED RESEARCH INTERNATIONAL 2014; 2014:519787. [PMID: 25019083 PMCID: PMC4082857 DOI: 10.1155/2014/519787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Revised: 05/13/2014] [Accepted: 05/23/2014] [Indexed: 11/23/2022]
Abstract
A new mathematical descriptor was proposed based on 3D graphical representation. Using the method, we construct the phylogenetic trees of nine proteins of H7N9 influenza virus to analyze the originated source of H7N9. The results show that the evolution route of H7N9 avian influenza is from America through Europe to Asia. Furthermore, two samples collected from environment in Nanjing and Zhejiang and one sample collected from chicken are the sources of H7N9 influenza virus that infected human in China.
Collapse
Affiliation(s)
- Yusheng Bai
- School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Tingting Ma
- School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Yuhua Yao
- School of Life Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Qi Dai
- School of Life Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
| | - Ping-an He
- School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China
| |
Collapse
|
11
|
El-Lakkani A, El-Sherif S. Similarity analysis of protein sequences based on 2D and 3D amino acid adjacency matrices. Chem Phys Lett 2013. [DOI: 10.1016/j.cplett.2013.10.032] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
12
|
Zhang YP, Ruan JS, He PA. Analyzes of the similarities of protein sequences based on the pseudo amino acid composition. Chem Phys Lett 2013. [DOI: 10.1016/j.cplett.2013.10.076] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
13
|
Jafarzadeh N, Iranmanesh A. C-curve: A novel 3D graphical representation of DNA sequence based on codons. Math Biosci 2013; 241:217-24. [DOI: 10.1016/j.mbs.2012.11.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2012] [Revised: 11/18/2012] [Accepted: 11/26/2012] [Indexed: 01/17/2023]
|