1
|
Zhang J, Deng L, Deng L. Protein structural domain-disease association prediction based on heterogeneous networks. BMC Genomics 2025; 23:869. [PMID: 40211147 PMCID: PMC11987217 DOI: 10.1186/s12864-024-11117-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Accepted: 12/02/2024] [Indexed: 04/12/2025] Open
Abstract
BACKGROUND Domains can be viewed as portable units of protein structure, folding, function, evolution, and design. Small proteins are often found to be composed of only a single domain, while most large proteins consist of multiple domains for achieving various composite cellular functions. A dysfunction in domains may affect the function of proteins in some disease. Inferring the disease-related domains will help our understanding of the mechanism of human complex diseases. RESULTS In this study, we firstly build a global heterogeneous information network based on structural-based domains, proteins, and diseases. Then the topological features of the network are extracted according to the meta-paths between domain and disease nodes. Finally, we train a binary classifier based on the XGBOOST (eXtreme Gradient Boosting) algorithm to predict the potential associations between domains and diseases. The results show that the binary classification model using the XGBOOST algorithm performs significantly better than models using other machine learning algorithms, achieving an AUC (Area Under Curve) score of 0.94 in the leave-one-out cross-validation experiment. CONCLUSIONS We develop a method to build a binary classifier using the topological features based on meta-paths and predict the potential associations between domains and diseases. Based on its predictive performance in independent test sets, the method is proved to be powerful. Moreover, representing domains and diseases through integrating more multi-omic data will further optimize predictive performance.
Collapse
Affiliation(s)
- Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, 467000, Pingdingshan, China
| | - Lianping Deng
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China.
| |
Collapse
|
2
|
Mughal F, Caetano-Anollés G. Evolution of intrinsic disorder in the structural domains of viral and cellular proteomes. Sci Rep 2025; 15:2878. [PMID: 39843714 PMCID: PMC11754631 DOI: 10.1038/s41598-025-86045-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 01/07/2025] [Indexed: 01/24/2025] Open
Abstract
Intrinsically disordered regions are flexible regions that complement the typical structured regions of proteins. Little is known however about their evolution. Here we leverage a comparative and evolutionary genomics approach to analyze intrinsic disorder in the structural domains of thousands of proteomes. Our analysis revealed that viral and cellular proteomes employ similar strategies to increase disorder but achieve different goals. Viral proteomes evolve disorder for economy of genomic material and multifunctionality. On the other hand, cellular proteomes evolve disorder to advance functionality with increasing genomic complexity. Remarkably, phylogenomic analysis of intrinsic disorder showed that ancient domains were ordered and that disorder evolved as a benefit acquired later in evolution. Evolutionary chronologies of domains indexed with disorder levels and distributions across Archaea, Bacteria, Eukarya and viruses revealed six evolutionary phases, the oldest two harboring only ordered and moderate disorder domains. A biphasic spectrum of disorder versus proteome makeup captured the dichotomy in the evolutionary trajectories of viral and cellular ancestors, one following reductive evolution driven by viral spread of molecular wealth and the other following expansive evolutionary trends to advance functionality through massive domain-forming co-option of disordered loop regions.
Collapse
Affiliation(s)
- Fizza Mughal
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA
| | - Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA.
- C.R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, 61801, USA.
| |
Collapse
|
3
|
Mughal F, Caetano-Anollés G. Evolution of Intrinsic Disorder in Protein Loops. Life (Basel) 2023; 13:2055. [PMID: 37895436 PMCID: PMC10608553 DOI: 10.3390/life13102055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 10/08/2023] [Accepted: 10/10/2023] [Indexed: 10/29/2023] Open
Abstract
Intrinsic disorder accounts for the flexibility of protein loops, molecular building blocks that are largely responsible for the processes and molecular functions of the living world. While loops likely represent early structural forms that served as intermediates in the emergence of protein structural domains, their origin and evolution remain poorly understood. Here, we conduct a phylogenomic survey of disorder in loop prototypes sourced from the ArchDB classification. Tracing prototypes associated with protein fold families along an evolutionary chronology revealed that ancient prototypes tended to be more disordered than their derived counterparts, with ordered prototypes developing later in evolution. This highlights the central evolutionary role of disorder and flexibility. While mean disorder increased with time, a minority of ordered prototypes exist that emerged early in evolutionary history, possibly driven by the need to preserve specific molecular functions. We also revealed the percolation of evolutionary constraints from higher to lower levels of organization. Percolation resulted in trade-offs between flexibility and rigidity that impacted prototype structure and geometry. Our findings provide a deep evolutionary view of the link between structure, disorder, flexibility, and function, as well as insights into the evolutionary role of intrinsic disorder in loops and their contribution to protein structure and function.
Collapse
Affiliation(s)
- Fizza Mughal
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, IL 61801, USA
| | - Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, IL 61801, USA
- C.R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| |
Collapse
|
4
|
Mughal F, Nasir A, Caetano-Anollés G. The origin and evolution of viruses inferred from fold family structure. Arch Virol 2020; 165:2177-2191. [PMID: 32748179 PMCID: PMC7398281 DOI: 10.1007/s00705-020-04724-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 05/30/2020] [Indexed: 12/16/2022]
Abstract
The canonical frameworks of viral evolution describe viruses as cellular predecessors, reduced forms of cells, or entities that escaped cellular control. The discovery of giant viruses has changed these standard paradigms. Their genetic, proteomic and structural complexities resemble those of cells, prompting a redefinition and reclassification of viruses. In a previous genome-wide analysis of the evolution of structural domains in proteomes, with domains defined at the fold superfamily level, we found the origins of viruses intertwined with those of ancient cells. Here, we extend these data-driven analyses to the study of fold families confirming the co-evolution of viruses and ancient cells and the genetic ability of viruses to foster molecular innovation. The results support our suggestion that viruses arose by genomic reduction from ancient cells and validate a co-evolutionary ‘symbiogenic’ model of viral origins.
Collapse
Affiliation(s)
- Fizza Mughal
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA
- Illinois Informatics Institute, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Arshan Nasir
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
- Department of Biosciences, COMSATS University Islamabad, Islamabad, Pakistan
| | - Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
- Illinois Informatics Institute, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
| |
Collapse
|
5
|
Deng L, Zhong G, Liu C, Luo J, Liu H. MADOKA: an ultra-fast approach for large-scale protein structure similarity searching. BMC Bioinformatics 2019; 20:662. [PMID: 31870277 PMCID: PMC6929402 DOI: 10.1186/s12859-019-3235-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 11/14/2019] [Indexed: 01/22/2023] Open
Abstract
Background Protein comparative analysis and similarity searches play essential roles in structural bioinformatics. A couple of algorithms for protein structure alignments have been developed in recent years. However, facing the rapid growth of protein structure data, improving overall comparison performance and running efficiency with massive sequences is still challenging. Results Here, we propose MADOKA, an ultra-fast approach for massive structural neighbor searching using a novel two-phase algorithm. Initially, we apply a fast alignment between pairwise structures. Then, we employ a score to select pairs with more similarity to carry out a more accurate fragment-based residue-level alignment. MADOKA performs about 6–100 times faster than existing methods, including TM-align and SAL, in massive alignments. Moreover, the quality of structural alignment of MADOKA is better than the existing algorithms in terms of TM-score and number of aligned residues. We also develop a web server to search structural neighbors in PDB database (About 360,000 protein chains in total), as well as additional features such as 3D structure alignment visualization. The MADOKA web server is freely available at: http://madoka.denglab.org/ Conclusions MADOKA is an efficient approach to search for protein structure similarity. In addition, we provide a parallel implementation of MADOKA which exploits massive power of multi-core CPUs.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Chenzhe Liu
- School of Computer Science and Engineering, Central South University, Changsha, 410075, China
| | - Judong Luo
- Department of Radiation Oncology, the Affiliated Changzhou No.2 People's Hospital of Nanjing Medical University, Changzhou, China.
| | - Hui Liu
- Lab of Information Management, Changzhou University, Changzhou, 213164, China.
| |
Collapse
|
6
|
Abstract
BACKGROUND A collection of disease-associated data contributes to study the association between diseases. Discovering closely related diseases plays a crucial role in revealing their common pathogenic mechanisms. This might further imply treatment that can be appropriated from one disease to another. During the past decades, a number of approaches for calculating disease similarity have been developed. However, most of them are designed to take advantage of single or few data sources, which results in their low accuracy. METHODS In this paper, we propose a novel method, called MultiSourcDSim, to calculate disease similarity by integrating multiple data sources, namely, gene-disease associations, GO biological process-disease associations and symptom-disease associations. Firstly, we establish three disease similarity networks according to the three disease-related data sources respectively. Secondly, the representation of each node is obtained by integrating the three small disease similarity networks. In the end, the learned representations are applied to calculate the similarity between diseases. RESULTS Our approach shows the best performance compared to the other three popular methods. Besides, the similarity network built by MultiSourcDSim suggests that our method can also uncover the latent relationships between diseases. CONCLUSIONS MultiSourcDSim is an efficient approach to predict similarity between diseases.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410075 China
| | - Danyi Ye
- School of Computer Science and Engineering, Central South University, Changsha, 410075 China
| | - Junmin Zhao
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, 467000 China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, 467000 China
| |
Collapse
|
7
|
Deng L, Wang J, Zhang J. Predicting Gene Ontology Function of Human MicroRNAs by Integrating Multiple Networks. Front Genet 2019; 10:3. [PMID: 30761178 PMCID: PMC6361788 DOI: 10.3389/fgene.2019.00003] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 01/07/2019] [Indexed: 12/15/2022] Open
Abstract
MicroRNAs (miRNAs) have been demonstrated to play significant biological roles in many human biological processes. Inferring the functions of miRNAs is an important strategy for understanding disease pathogenesis at the molecular level. In this paper, we propose an integrated model, PmiRGO, to infer the gene ontology (GO) functions of miRNAs by integrating multiple data sources, including the expression profiles of miRNAs, miRNA-target interactions, and protein-protein interactions (PPI). PmiRGO starts by building a global network consisting of three networks. Then, it employs DeepWalk to learn latent representations as network features of the global heterogeneous network. Finally, the SVM-based models are applied to label the GO terms of miRNAs. The experimental results show that PmiRGO has a significantly better performance than existing state-of-the-art methods in terms of F max . A case study further demonstrates the feasibility of PmiRGO to annotate the potential functions of miRNAs.
Collapse
Affiliation(s)
- Lei Deng
- School of Software, Central South University, Changsha, China
| | - Jiacheng Wang
- School of Software, Central South University, Changsha, China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan, China
| |
Collapse
|
8
|
Integrating Multiple Interaction Networks for Gene Function Inference. Molecules 2018; 24:molecules24010030. [PMID: 30577643 PMCID: PMC6337127 DOI: 10.3390/molecules24010030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 12/19/2018] [Accepted: 12/20/2018] [Indexed: 01/17/2023] Open
Abstract
In the past few decades, the number and variety of genomic and proteomic data available have increased dramatically. Molecular or functional interaction networks are usually constructed according to high-throughput data and the topological structure of these interaction networks provide a wealth of information for inferring the function of genes or proteins. It is a widely used way to mine functional information of genes or proteins by analyzing the association networks. However, it remains still an urgent but unresolved challenge how to combine multiple heterogeneous networks to achieve more accurate predictions. In this paper, we present a method named ReprsentConcat to improve function inference by integrating multiple interaction networks. The low-dimensional representation of each node in each network is extracted, then these representations from multiple networks are concatenated and fed to gcForest, which augment feature vectors by cascading and automatically determines the number of cascade levels. We experimentally compare ReprsentConcat with a state-of-the-art method, showing that it achieves competitive results on the datasets of yeast and human. Moreover, it is robust to the hyperparameters including the number of dimensions.
Collapse
|