1
|
Hayes WB. Exact p-values for global network alignments via combinatorial analysis of shared GO terms : REFANGO: Rigorous Evaluation of Functional Alignments of Networks using Gene Ontology. J Math Biol 2024; 88:50. [PMID: 38551701 PMCID: PMC10980677 DOI: 10.1007/s00285-024-02058-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2020] [Revised: 01/21/2024] [Accepted: 02/05/2024] [Indexed: 04/01/2024]
Abstract
Network alignment aims to uncover topologically similar regions in the protein-protein interaction (PPI) networks of two or more species under the assumption that topologically similar regions tend to perform similar functions. Although there exist a plethora of both network alignment algorithms and measures of topological similarity, currently no "gold standard" exists for evaluating how well either is able to uncover functionally similar regions. Here we propose a formal, mathematically and statistically rigorous method for evaluating the statistical significance of shared GO terms in a global, 1-to-1 alignment between two PPI networks. Given an alignment in which k aligned protein pairs share a particular GO term g, we use a combinatorial argument to precisely quantify the p-value of that alignment with respect to g compared to a random alignment. The p-value of the alignment with respect to all GO terms, including their inter-relationships, is approximated using the Empirical Brown's Method. We note that, just as with BLAST's p-values, this method is not designed to guide an alignment algorithm towards a solution; instead, just as with BLAST, an alignment is guided by a scoring matrix or function; the p-values herein are computed after the fact, providing independent feedback to the user on the biological quality of the alignment that was generated by optimizing the scoring function. Importantly, we demonstrate that among all GO-based measures of network alignments, ours is the only one that correlates with the precision of GO annotation predictions, paving the way for network alignment-based protein function prediction.
Collapse
Affiliation(s)
- Wayne B Hayes
- Department of Computer Science, UC Irvine, Irvine, USA.
| |
Collapse
|
2
|
Cross-attention PHV: Prediction of human and virus protein-protein interactions using cross-attention-based neural networks. Comput Struct Biotechnol J 2022; 20:5564-5573. [PMID: 36249566 PMCID: PMC9546503 DOI: 10.1016/j.csbj.2022.10.012] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 10/05/2022] [Accepted: 10/05/2022] [Indexed: 11/30/2022] Open
Abstract
Cross-attention PHV implements two key technologies: cross-attention mechanism and 1D-CNN. It accurately predicts PPIs between human and unknown influenza viruses/SARS-CoV-2. It extracts critical taxonomic and evolutionary differences responsible for PPI prediction.
Viral infections represent a major health concern worldwide. The alarming rate at which SARS-CoV-2 spreads, for example, led to a worldwide pandemic. Viruses incorporate genetic material into the host genome to hijack host cell functions such as the cell cycle and apoptosis. In these viral processes, protein–protein interactions (PPIs) play critical roles. Therefore, the identification of PPIs between humans and viruses is crucial for understanding the infection mechanism and host immune responses to viral infections and for discovering effective drugs. Experimental methods including mass spectrometry-based proteomics and yeast two-hybrid assays are widely used to identify human-virus PPIs, but these experimental methods are time-consuming, expensive, and laborious. To overcome this problem, we developed a novel computational predictor, named cross-attention PHV, by implementing two key technologies of the cross-attention mechanism and a one-dimensional convolutional neural network (1D-CNN). The cross-attention mechanisms were very effective in enhancing prediction and generalization abilities. Application of 1D-CNN to the word2vec-generated feature matrices reduced computational costs, thus extending the allowable length of protein sequences to 9000 amino acid residues. Cross-attention PHV outperformed existing state-of-the-art models using a benchmark dataset and accurately predicted PPIs for unknown viruses. Cross-attention PHV also predicted human–SARS-CoV-2 PPIs with area under the curve values >0.95. The Cross-attention PHV web server and source codes are freely available at https://kurata35.bio.kyutech.ac.jp/Cross-attention_PHV/ and https://github.com/kuratahiroyuki/Cross-Attention_PHV, respectively.
Collapse
Key Words
- 1D-CNN, One-dimensional-CNN
- AC, Accuracy
- AUC, Area under the curve
- CNN, Convolutional neural network
- Convolutional neural network
- DT, Decision tree
- F1, F1-score
- HV-PPIs, Human-virus PPIs
- HuV-PPI, Human–unknown virus PPI
- Human
- LR, Linear regression
- MCC, Matthews correlation coefficient
- PPIs, Protein-protein interactions
- Protein–protein interaction
- RF, Random forest
- SARS-CoV-2
- SARS-CoV-2, Severe acute respiratory syndrome coronavirus 2
- SN, Sensitivity
- SP, Specificity
- SVM, Support vector machine
- T-SNE, T-distributed stochastic neighbor embedding
- Virus
- W2V, Word2vec
- Word2vec
Collapse
|
3
|
Wang S, Chen X, Frederisy BJ, Mbakogu BA, Kanne AD, Khosravi P, Hayes WB. On the current failure-but bright future-of topology-driven biological network alignment. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 131:1-44. [PMID: 35871888 DOI: 10.1016/bs.apcsb.2022.05.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Since the function of a protein is defined by its interaction partners, and since we expect similar interaction patterns across species, the alignment of protein-protein interaction (PPI) networks between species, based on network topology alone, should uncover functionally related proteins across species. Surprisingly, despite the publication of more than fifty algorithms aimed at performing PPI network alignment, few have demonstrated a statistically significant link between network topology and functional similarity, and none have demonstrated that orthologs can be recovered using network topology alone. We find that the major contributing factors to this surprising failure are: (i) edge densities in most currently available experimental PPI networks are demonstrably too low to expect topological network alignment to succeed; (ii) in the few cases where the edge densities are high enough, some measures of topological similarity easily uncover functionally similar proteins while others do not; and (iii) most network alignment algorithms to date perform poorly at optimizing even their own topological objective functions, hampering their ability to use topology effectively. We demonstrate that SANA-the Simulated Annealing Network Aligner-significantly outperforms existing aligners at optimizing their own objective functions, even achieving near-optimal solutions when the optimal solution is known. We offer the first demonstration of global network alignments based on topology alone that align functionally similar proteins with p-values in some cases below 10-300. We predict that topological network alignment has a bright future as edge densities increase toward the value where good alignments become possible. We demonstrate that when enough common topology is present at high enough edge densities-for example in the recent, partly synthetic networks of the Integrated Interaction Database-topological network alignment easily recovers most orthologs, paving the way toward high-throughput functional prediction based on topology-driven network alignment.
Collapse
Affiliation(s)
- Siyue Wang
- Department of Computer Science, University of California, Irvine, CA, United States
| | - Xiaoyin Chen
- Department of Computer Science, University of California, Irvine, CA, United States
| | - Brent J Frederisy
- Department of Computer Science, University of California, Irvine, CA, United States
| | - Benedict A Mbakogu
- Department of Computer Science, University of California, Irvine, CA, United States
| | - Amy D Kanne
- Department of Computer Science, University of California, Irvine, CA, United States
| | - Pasha Khosravi
- Department of Computer Science, University of California, Irvine, CA, United States
| | - Wayne B Hayes
- Department of Computer Science, University of California, Irvine, CA, United States.
| |
Collapse
|
4
|
Pairwise Biological Network Alignment Based on Discrete Bat Algorithm. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:5548993. [PMID: 34777564 PMCID: PMC8580637 DOI: 10.1155/2021/5548993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/19/2021] [Revised: 09/29/2021] [Accepted: 10/13/2021] [Indexed: 11/18/2022]
Abstract
The development of high-throughput technology has provided a reliable technical guarantee for an increased amount of available data on biological networks. Network alignment is used to analyze these data to identify conserved functional network modules and understand evolutionary relationships across species. Thus, an efficient computational network aligner is needed for network alignment. In this paper, the classic bat algorithm is discretized and applied to the network alignment. The bat algorithm initializes the population randomly and then searches for the optimal solution iteratively. Based on the bat algorithm, the global pairwise alignment algorithm BatAlign is proposed. In BatAlign, the individual velocity and the position are represented by a discrete code. BatAlign uses a search algorithm based on objective function that uses the number of conserved edges as the objective function. The similarity between the networks is used to initialize the population. The experimental results showed that the algorithm was able to match proteins with high functional consistency and reach a relatively high topological quality.
Collapse
|
5
|
Mahdipour E, Ghasemzadeh M. The protein-protein interaction network alignment using recurrent neural network. Med Biol Eng Comput 2021; 59:2263-2286. [PMID: 34529185 DOI: 10.1007/s11517-021-02428-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Accepted: 08/05/2021] [Indexed: 11/29/2022]
Abstract
The main challenge of biological network alignment is that the problem of finding the alignments in two graphs is NP-hard. The discovery of protein-protein interaction (PPI) networks is of great importance in bioinformatics due to their utilization in identifying the cellular pathways, finding new medicines, and disease recognition. In this regard, we describe the network alignment method in the form of a classification problem for the very first time and introduce a deep network that finds the alignment of nodes present in the two networks. We call this method RENA, which means Network Alignment using REcurrent neural network. The proposed solution consists of three steps; in the first phase, we obtain the sequence and topological similarities from the networks' structure. For the second phase, the dataset needed for the transformation of the problem into a classification problem is created from obtained features. In the third phase, we predict the nodes' alignment between two networks using deep learning. We used Biogrid dataset for RENA evaluation. The RENA method is compared with three classification approaches of support vector machine, K-nearest neighbors, and linear discriminant analysis. The experimental results demonstrate the efficiency of the RENA method and 100% accuracy in PPI network alignment prediction.
Collapse
Affiliation(s)
- Elham Mahdipour
- Computer Engineering Department at Khavaran Institute of Higher Education, Mashhad, Iran.
| | | |
Collapse
|
6
|
SSA: Subset sum approach to protein β-sheet structure prediction. Comput Biol Chem 2021; 94:107552. [PMID: 34390958 DOI: 10.1016/j.compbiolchem.2021.107552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Revised: 07/21/2021] [Accepted: 07/27/2021] [Indexed: 11/22/2022]
Abstract
The three-dimensional structures of proteins provide their functions and incorrect folding of its β-strands can be the cause of many diseases. There are two major approaches for determining protein structures: computational prediction and experimental methods that employ technologies such as Cryo-electron microscopy. Due to experimental methods's high costs, extended wait times for its lengthy processes, and incompleteness of results, computational prediction is an attractive alternative. As the focus of the present paper, β-sheet structure prediction is a major portion of overall protein structure prediction. Prediction of other substructures, such as α-helices, is simpler with lower computational time complexities. Brute force methods are the most common approach and dynamic programming is also utilized to generate all possible conformations. The current study introduces the Subset Sum Approach (SSA) for the direct search space generation method, which is shown to outperform the dynamic programming approach in terms of both time and space. For the first time, the present work has calculated both the state space cardinality of the dynamic programming approach and the search space cardinality of the general brute force approaches. In regard to a set of pruning rules, SSA has demonstrated higher efficiency with respect to both time and accuracy in comparison to state-of-the-art methods.
Collapse
|
7
|
Behkamal B, Naghibzadeh M, Pagnani A, Saberi MR, Al Nasr K. Solving the α-helix correspondence problem at medium-resolution Cryo-EM maps through modeling and 3D matching. J Mol Graph Model 2020; 103:107815. [PMID: 33338845 DOI: 10.1016/j.jmgm.2020.107815] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Revised: 11/09/2020] [Accepted: 11/18/2020] [Indexed: 11/30/2022]
Abstract
Cryo-electron microscopy (cryo-EM) has recently emerged as a prominent biophysical method for macromolecular structure determination. Many research efforts have been devoted to produce cryo-EM images, density maps, at near-atomic resolution. Despite many advances in technology, the resolution of the generated density maps may not be sufficiently adequate and informative to directly construct the atomic structure of proteins. At medium-resolution (∼4-10 Å), secondary structure elements (α-helices and β-sheets) are discernible, whereas finding the correspondence of secondary structure elements detected in the density map with those on the sequence remains a challenging problem. In this paper, an automatic framework is proposed to solve α-helix correspondence problem in three-dimensional space. Through modeling of the sequence with the aid of a novel strategy, the α-helix correspondence problem is initially transformed into a complete weighted bipartite graph matching problem. An innovative correlation-based scoring function based on a well-known and robust statistical method is proposed for weighting the graph. Moreover, two local optimization algorithms, which are Greedy and Improved Greedy algorithms, have been presented to find α-helix correspondence. A widely used data set including 16 reconstructed and 4 experimental cryo-EM maps were chosen to verify the accuracy and reliability of the proposed automatic method. The experimental results demonstrate that the automatic method is highly efficient (86.25% accuracy), robust (11.3% error rate), fast (∼1.4 s), and works independently from cryo-EM skeleton.
Collapse
Affiliation(s)
- Bahareh Behkamal
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, 9177948944, Iran.
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, 9177948944, Iran.
| | - Andrea Pagnani
- Department of Applied Science and Technology (DISAT), Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy; Italian Institute for Genomic Medicine (IIGM), IRCC-Candiolo, Candiolo, TO, Italy; INFN Sezione di Torino, Via P. Giuria 1, Torino, Italy
| | - Mohammad Reza Saberi
- Medicinal Chemistry Department, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran; Bioinformatics Research group, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Kamal Al Nasr
- Department of Computer Science, Tennessee State University, Nashville, TN, 37209, USA
| |
Collapse
|
8
|
Khorsand B, Savadi A, Naghibzadeh M. SARS-CoV-2-human protein-protein interaction network. INFORMATICS IN MEDICINE UNLOCKED 2020; 20:100413. [PMID: 32838020 PMCID: PMC7425553 DOI: 10.1016/j.imu.2020.100413] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2020] [Revised: 07/11/2020] [Accepted: 08/10/2020] [Indexed: 12/13/2022] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the novel coronavirus which caused the coronavirus disease 2019 pandemic and infected more than 12 million victims and resulted in over 560,000 deaths in 213 countries around the world. Having no symptoms in the first week of infection increases the rate of spreading the virus. The increasing rate of the number of infected individuals and its high mortality necessitates an immediate development of proper diagnostic methods and effective treatments. SARS-CoV-2, similar to other viruses, needs to interact with the host proteins to reach the host cells and replicate its genome. Consequently, virus-host protein-protein interaction (PPI) identification could be useful in predicting the behavior of the virus and the design of antiviral drugs. Identification of virus-host PPIs using experimental approaches are very time consuming and expensive. Computational approaches could be acceptable alternatives for many preliminary investigations. In this study, we developed a new method to predict SARS-CoV-2-human PPIs. Our model is a three-layer network in which the first layer contains the most similar Alphainfluenzavirus proteins to SARS-CoV-2 proteins. The second layer contains protein-protein interactions between Alphainfluenzavirus proteins and human proteins. The last layer reveals protein-protein interactions between SARS-CoV-2 proteins and human proteins by using the clustering coefficient network property on the first two layers. To further analyze the results of our prediction network, we investigated human proteins targeted by SARS-CoV-2 proteins and reported the most central human proteins in human PPI network. Moreover, differentially expressed genes of previous researches were investigated and PPIs of SARS-CoV-2-human network, the human proteins of which were related to upregulated genes, were reported.
Collapse
Affiliation(s)
- Babak Khorsand
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Abdorreza Savadi
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Faculty of Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
9
|
Developing an ultra-efficient microsatellite discoverer to find structural differences between SARS-CoV-1 and Covid-19. INFORMATICS IN MEDICINE UNLOCKED 2020; 19:100356. [PMID: 32501423 PMCID: PMC7241407 DOI: 10.1016/j.imu.2020.100356] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 05/20/2020] [Accepted: 05/20/2020] [Indexed: 12/24/2022] Open
Abstract
Motivation Recently, the outbreak of Coronavirus-Covid-19 has forced the World Health Organization to declare a pandemic status. A genome sequence is the core of this virus which interferes with the normal activities of its counterparts within humans. Analysis of its genome may provide clues toward the proper treatment of patients and the design of new drugs and vaccines. Microsatellites are composed of short genome subsequences which are successively repeated many times in the same direction. They are highly variable in terms of their building blocks, number of repeats, and their locations in the genome sequences. This mutability property has been the source of many diseases. Usually the host genome is analyzed to diagnose possible diseases in the victim. In this research, the focus is concentrated on the attacker's genome for discovery of its malicious properties. Results The focus of this research is the microsatellites of both SARS and Covid-19. An accurate and highly efficient computer method for identifying all microsatellites in the genome sequences is discovered and implemented, and it is used to find all microsatellites in the Coronavirus-Covid-19 and SARS2003. The Microsatellite discovery is based on an efficient indexing technique called K-Mer Hash Indexing. The method is called Fast Microsatellite Discovery (FMSD) and it is used for both SARS and Covid-19. A table composed of all microsatellites is reported. There are many differences between SARS and Covid-19, but there is an outstanding difference which requires further investigation. Availability FMSD is freely available at https://gitlab.com/FUM_HPCLab/fmsd_project, implemented in C on Linux-Ubuntu system. Software related contact: hossein_savari@mail.um.ac.ir.
Collapse
|