Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Hou J, Adhikari B, Cheng J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 2019;34:1295-1303. [PMID: 29228193 PMCID: PMC5905591 DOI: 10.1093/bioinformatics/btx780] [Citation(s) in RCA: 92] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2017] [Accepted: 12/07/2017] [Indexed: 11/30/2022] Open

For:	Hou J, Adhikari B, Cheng J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 2019;34:1295-1303. [PMID: 29228193 PMCID: PMC5905591 DOI: 10.1093/bioinformatics/btx780] [Citation(s) in RCA: 92] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2017] [Accepted: 12/07/2017] [Indexed: 11/30/2022] Open

Number

Cited by Other Article(s)

Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025;2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]

Abstract

Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.

Collapse

Kim S, Kim MA, Kim B, Lee J, Jung SK, Kim J, Chung HY, Lee CY, Jeong S. Machine learning assessment of zoonotic potential in avian influenza viruses using PB2 segment. BMC Genomics 2025;26:395. [PMID: 40269678 PMCID: PMC12020041 DOI: 10.1186/s12864-025-11589-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 04/09/2025] [Indexed: 04/25/2025] Open

Abstract

BACKGROUND

Influenza A virus (IAV) is a major global health threat, causing seasonal epidemics and occasional pandemics. Particularly, Influenza A viruses from avian species pose significant zoonotic threats, with PB2 adaptation serving as a critical first step in cross-species transmission. A comprehensive risk assessment framework based on PB2 sequences is necessary, which should encompass detailed analyses of specific residues and mutations while maintaining sufficient generality for application to non-PB2 segments.

RESULTS

In this study, we developed two complementary approaches: a regression-based model for accurately distinguishing among risk groups, and a SHAP-based risk assessment model for more meaningful risk analyses. For the regression-based risk models, we compared various methodologies, including tree ensemble methods, conventional regression models, and deep learning architectures. The optimized regression model, combined with SHAP value analysis, identified and ranked individual residues contributing to zoonotic potential. The SHAP-based risk model enabled intra-class analyses within the zoonotic risk assessment framework and quantified risk yields from specific mutations.

CONCLUSION

Experimental analyses demonstrated that the Random Forest regression model outperformed other models in most cases, and we validated the target value settings for risk regression through ablation studies. Our SHAP-based analysis identified key residues (271A, 627K, 591R, 588A, 292I, 684S, 684A, 81M, 199S, and 368Q) and mutations (T271A, Q368R/K, E627K, Q591R, A588T/I/V, and I292V/T) critical for zoonotic risk assessment. Using the SHAP-based risk assessment model, we found that influenza A viruses from Phasianidae showed elevated zoonotic risk scores compared to those from other avian species. Additionally, mutations I292V/T, Q368R, A588T/I, V598A/I/T, and E/V627K were identified as significant mutations in the Phasianidae. These PB2-focused quantitative methods provide a robust and generalizable framework for both rapid screening of avians' zoonotic potential and analytical quantification of risks associated with specific residues or mutations.

Collapse

Li J, Chen X, Huang H, Zeng M, Yu J, Gong X, Ye Q. $\mathcal{S}$ able: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm. Brief Bioinform 2025;26:bbaf120. [PMID: 40163822 PMCID: PMC11957296 DOI: 10.1093/bib/bbaf120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 01/23/2025] [Accepted: 02/23/2025] [Indexed: 04/02/2025] Open

Yang D, Peng X, Zheng S, Peng S. Deep learning-based prediction of autoimmune diseases. Sci Rep 2025;15:4576. [PMID: 39920178 PMCID: PMC11806040 DOI: 10.1038/s41598-025-88477-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Accepted: 01/28/2025] [Indexed: 02/09/2025] Open

Wang D, Pourmirzaei M, Abbas UL, Zeng S, Manshour N, Esmaili F, Poudel B, Jiang Y, Shao Q, Chen J, Xu D. S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025;12:e2404212. [PMID: 39665266 PMCID: PMC11791933 DOI: 10.1002/advs.202404212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 08/21/2024] [Indexed: 12/13/2024]

Wang M, Wang J, Ji J, Ma C, Wang H, He J, Song Y, Zhang X, Cao Y, Dai Y, Hua M, Qin R, Li K, Cao L. Improving compound-protein interaction prediction by focusing on intra-modality and inter-modality dynamics with a multimodal tensor fusion strategy. Comput Struct Biotechnol J 2024;23:3714-3729. [PMID: 39525082 PMCID: PMC11544084 DOI: 10.1016/j.csbj.2024.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 10/01/2024] [Accepted: 10/01/2024] [Indexed: 11/16/2024] Open

Qin X, Zhang L, Liu M, Liu G. PRFold-TNN: Protein Fold Recognition With an Ensemble Feature Selection Method Using PageRank Algorithm Based on Transformer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024;21:1740-1751. [PMID: 38875077 DOI: 10.1109/tcbb.2024.3414497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2024]

Satalkar V, Degaga GD, Li W, Pang YT, McShan AC, Gumbart JC, Mitchell JC, Torres MP. Generative β-hairpin design using a residue-based physicochemical property landscape. Biophys J 2024;123:2790-2806. [PMID: 38297834 PMCID: PMC11393682 DOI: 10.1016/j.bpj.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/20/2023] [Accepted: 01/25/2024] [Indexed: 02/02/2024] Open

Tan Y, Li M, Zhou Z, Tan P, Yu H, Fan G, Hong L. PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications. J Cheminform 2024;16:92. [PMID: 39095917 PMCID: PMC11297785 DOI: 10.1186/s13321-024-00884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Accepted: 07/13/2024] [Indexed: 08/04/2024] Open

Affiliation(s)

Yang Tan School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
Mingchen Li School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China
Ziyi Zhou Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China
Pan Tan Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China
Huiqun Yu School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
Guisheng Fan School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China.
Liang Hong Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Science, Shanghai Jiao Tong University, Shanghai, 200240, China. Shanghai Artificial Intelligence Laboratory, Shanghai, 200240, China. Chongqing Artificial Intelligence Research Institute of Shanghai Jiao Tong University, Chongqing, 200240, China.

Collapse

Jamasb AR, Morehead A, Joshi CK, Zhang Z, Didi K, Mathis S, Harris C, Tang J, Cheng J, Liò P, Blundell TL. Evaluating Representation Learning on the Protein Structure Universe. ARXIV 2024:arXiv:2406.13864v1. [PMID: 38947934 PMCID: PMC11213157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]

Nguyen VTD, Hy TS. Multimodal pretraining for unsupervised protein representation learning. Biol Methods Protoc 2024;9:bpae043. [PMID: 38983679 PMCID: PMC11233121 DOI: 10.1093/biomethods/bpae043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Revised: 05/30/2024] [Accepted: 06/12/2024] [Indexed: 07/11/2024] Open

Dotan E, Jaschek G, Pupko T, Belinkov Y. Effect of tokenization on transformers for biological sequences. Bioinformatics 2024;40:btae196. [PMID: 38608190 PMCID: PMC11055402 DOI: 10.1093/bioinformatics/btae196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/20/2024] [Accepted: 04/11/2024] [Indexed: 04/14/2024] Open

Abstract

MOTIVATION

Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families.

RESULTS

We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.

AVAILABILITY AND IMPLEMENTATION

Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Collapse

Ektefaie Y, Shen A, Bykova D, Marin M, Zitnik M, Farhat M. Evaluating generalizability of artificial intelligence models for molecular datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.25.581982. [PMID: 38464295 PMCID: PMC10925170 DOI: 10.1101/2024.02.25.581982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]

Valentini G, Malchiodi D, Gliozzo J, Mesiti M, Soto-Gomez M, Cabri A, Reese J, Casiraghi E, Robinson PN. The promises of large language models for protein design and modeling. FRONTIERS IN BIOINFORMATICS 2023;3:1304099. [PMID: 38076030 PMCID: PMC10701588 DOI: 10.3389/fbinf.2023.1304099] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 11/07/2023] [Indexed: 10/16/2024] Open

Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023;24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open

Yang M, Chen S, Huang Z, Gao S, Yu T, Du T, Zhang H, Li X, Liu CM, Chen S, Li H. Deep learning-enabled discovery and characterization of HKT genes in Spartina alterniflora. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2023;116:690-705. [PMID: 37494542 DOI: 10.1111/tpj.16397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/03/2023] [Accepted: 07/11/2023] [Indexed: 07/28/2023]

Abstract

Spartina alterniflora is a halophyte that can survive in high-salinity environments, and it is phylogenetically close to important cereal crops, such as maize and rice. It is of scientific interest to understand why S. alterniflora can live under such extremely stressful conditions. The molecular mechanism underlying its high-saline tolerance is still largely unknown. Here we investigated the possibility that high-affinity K+ transporters (HKTs), which function in salt tolerance and maintenance of ion homeostasis in plants, are responsible for salt tolerance in S. alterniflora. To overcome the imprecision and unstable of the gene screening method caused by the conventional sequence alignment, we used a deep learning method, DeepGOPlus, to automatically extract sequence and protein characteristics from our newly assemble S. alterniflora genome to identify SaHKTs. Results showed that a total of 16 HKT genes were identified. The number of S. alterniflora HKTs (SaHKTs) is larger than that in all other investigated plant species except wheat. Phylogenetically related SaHKT members had similar gene structures, conserved protein domains and cis-elements. Expression profiling showed that most SaHKT genes are expressed in specific tissues and are differentially expressed under salt stress. Yeast complementation expression analysis showed that type I members SaHKT1;2, SaHKT1;3 and SaHKT1;8 and type II members SaHKT2;1, SaHKT2;3 and SaHKT2;4 had low-affinity K+ uptake ability and that type II members showed stronger K+ affinity than rice and Arabidopsis HKTs, as well as most SaHKTs showed preference for Na+ transport. We believe the deep learning-based methods are powerful approaches to uncovering new functional genes, and the SaHKT genes identified are important resources for breeding new varieties of salt-tolerant crops.

Collapse

Affiliation(s)

Maogeng Yang State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China Key Laboratory of Plant Molecular & Developmental Biology, College of Life Sciences, Yantai University, Yantai, Shandong, China
Shoukun Chen State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China Hainan Yazhou Bay Seed Laboratory, Sanya, Hainan, China
Zhangping Huang State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China
Shang Gao State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China
Tingxi Yu State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China
Tingting Du State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China
Hao Zhang State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China
Xiang Li State Key Laboratory of Plant Genomics and National Center for Plant Gene Research, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese Academy of Sciences, Beijing, China
Chun-Ming Liu State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Key Laboratory of Plant Molecular Physiology, Institute of Botany, Chinese Academy of Sciences, Beijing, China College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China School of Advanced Agricultural Sciences, Peking University, Beijing, China
Shihua Chen Key Laboratory of Plant Molecular & Developmental Biology, College of Life Sciences, Yantai University, Yantai, Shandong, China
Huihui Li State Key Laboratory of Crop Gene Resources and Breeding, Institute of Crop Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, China Nanfan Research Institute, CAAS, Sanya, Hainan, China

Collapse

Hui T, Descoteaux ML, Miao J, Lin YS. Training Neural Network Models Using Molecular Dynamics Simulation Results to Efficiently Predict Cyclic Hexapeptide Structural Ensembles. J Chem Theory Comput 2023. [PMID: 37236147 PMCID: PMC10373485 DOI: 10.1021/acs.jctc.3c00154] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]

Wu F, Courty N, Jin S, Li SZ. Improving molecular representation learning with metric learning-enhanced optimal transport. PATTERNS (NEW YORK, N.Y.) 2023;4:100714. [PMID: 37123438 PMCID: PMC10140620 DOI: 10.1016/j.patter.2023.100714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 12/29/2022] [Accepted: 03/01/2023] [Indexed: 05/02/2023]

Sanderson T, Bileschi ML, Belanger D, Colwell LJ. ProteInfer, deep neural networks for protein functional inference. eLife 2023;12:e80942. [PMID: 36847334 PMCID: PMC10063232 DOI: 10.7554/elife.80942] [Citation(s) in RCA: 54] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Accepted: 02/24/2023] [Indexed: 03/01/2023] Open

Wang F, Feng X, Kong R, Chang S. Generating new protein sequences by using dense network and attention mechanism. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023;20:4178-4197. [PMID: 36899622 DOI: 10.3934/mbe.2023195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]

Wu L, Yin C, Zhu J, Wu Z, He L, Xia Y, Xie S, Qin T, Liu TY. SPRoBERTa: protein embedding learning with local fragment modeling. Brief Bioinform 2022;23:6711410. [PMID: 36136367 DOI: 10.1093/bib/bbac401] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 07/18/2022] [Accepted: 08/18/2022] [Indexed: 12/14/2022] Open

An J, Weng X. Collectively encoding protein properties enriches protein language models. BMC Bioinformatics 2022;23:467. [DOI: 10.1186/s12859-022-05031-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 10/31/2022] [Indexed: 11/10/2022] Open

Wu F, Jin S, Jiang Y, Jin X, Tang B, Niu Z, Liu X, Zhang Q, Zeng X, Li SZ. Pre-Training of Equivariant Graph Matching Networks with Conformation Flexibility for Drug Binding. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022;9:e2203796. [PMID: 36202759 PMCID: PMC9685463 DOI: 10.1002/advs.202203796] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 09/07/2022] [Indexed: 05/16/2023]

Controllable protein design with language models. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00499-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]

Bileschi ML, Belanger D, Bryant DH, Sanderson T, Carter B, Sculley D, Bateman A, DePristo MA, Colwell LJ. Using deep learning to annotate the protein universe. Nat Biotechnol 2022;40:932-937. [PMID: 35190689 DOI: 10.1038/s41587-021-01179-w] [Citation(s) in RCA: 127] [Impact Index Per Article: 42.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Accepted: 12/02/2021] [Indexed: 12/30/2022]

Feng G, Yao H, Li C, Liu R, Huang R, Fan X, Ge R, Miao Q. ME-ACP: Multi-view neural networks with ensemble model for identification of anticancer peptides. Comput Biol Med 2022;145:105459. [DOI: 10.1016/j.compbiomed.2022.105459] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/22/2022] [Accepted: 03/24/2022] [Indexed: 12/26/2022]

Pang Y, Liu B. SelfAT-Fold: Protein Fold Recognition Based on Residue-Based and Motif-Based Self-Attention Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:1861-1869. [PMID: 33090951 DOI: 10.1109/tcbb.2020.3031888] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Hakala K, Kaewphan S, Bjorne J, Mehryary F, Moen H, Tolvanen M, Salakoski T, Ginter F. Neural Network and Random Forest Models in Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022;19:1772-1781. [PMID: 33306472 DOI: 10.1109/tcbb.2020.3044230] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]

Villegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform 2022;23:6571527. [PMID: 35443054 DOI: 10.1093/bib/bbac142] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 03/21/2022] [Accepted: 03/28/2022] [Indexed: 11/13/2022] Open

Zhang J, Yan K, Chen Q, Liu B. PreRBP-TL: prediction of species-specific RNA-binding proteins based on transfer learning. Bioinformatics 2022;38:2135-2143. [PMID: 35176130 DOI: 10.1093/bioinformatics/btac106] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 11/18/2021] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open

Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nat Commun 2022;13:1914. [PMID: 35395843 PMCID: PMC8993921 DOI: 10.1038/s41467-022-29443-w] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 03/15/2022] [Indexed: 01/27/2023] Open

Dee W. LMPred: predicting antimicrobial peptides using pre-trained language models and deep learning. BIOINFORMATICS ADVANCES 2022;2:vbac021. [PMID: 36699381 PMCID: PMC9710646 DOI: 10.1093/bioadv/vbac021] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 03/01/2022] [Accepted: 03/29/2022] [Indexed: 01/28/2023]

Roethel A, Biliński P, Ishikawa T. BioS2Net: Holistic Structural and Sequential Analysis of Biomolecules Using a Deep Neural Network. Int J Mol Sci 2022;23:2966. [PMID: 35328384 PMCID: PMC8954277 DOI: 10.3390/ijms23062966] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Revised: 03/05/2022] [Accepted: 03/08/2022] [Indexed: 01/07/2023] Open

A deep learning model to detect novel pore-forming proteins. Sci Rep 2022;12:2013. [PMID: 35132124 PMCID: PMC8821639 DOI: 10.1038/s41598-022-05970-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 01/12/2022] [Indexed: 11/09/2022] Open

Jin X, Luo X, Liu B. PHR-search: a search framework for protein remote homology detection based on the predicted protein hierarchical relationships. Brief Bioinform 2022;23:6520306. [PMID: 35134113 DOI: 10.1093/bib/bbab609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 12/14/2021] [Accepted: 12/30/2021] [Indexed: 11/13/2022] Open

Rorabaugh AK, Caíno-Lores S, Johnston T, Taufer M. High frequency accuracy and loss data of random neural networks trained on image datasets. Data Brief 2022;40:107780. [PMID: 35036484 PMCID: PMC8749157 DOI: 10.1016/j.dib.2021.107780] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 12/15/2021] [Accepted: 12/29/2021] [Indexed: 11/15/2022] Open

Abstract

Neural Networks (NNs) are increasingly used across scientific domains to extract knowledge from experimental or computational data. An NN is composed of natural or artificial neurons that serve as simple processing units and are interconnected into a model architecture; it acquires knowledge from the environment through a learning process and stores this knowledge in its connections. The learning process is conducted by training. During NN training, the learning process can be tracked by periodically validating the NN and calculating its fitness. The resulting sequence of fitness values (i.e., validation accuracy or validation loss) is called the NN learning curve. The development of tools for NN design requires knowledge of diverse NNs and their complete learning curves. Generally, only final fully-trained fitness values for highly accurate NNs are made available to the community, hampering efforts to develop tools for NN design and leaving unaddressed aspects such as explaining the generation of an NN and reproducing its learning process. Our dataset fills this gap by fully recording the structure, metadata, and complete learning curves for a wide variety of random NNs throughout their training. Our dataset captures the lifespan of 6000 NNs throughout generation, training, and validation stages. It consists of a suite of 6000 tables, each table representing the lifespan of one NN. We generate each NN with randomized parameter values and train it for 40 epochs on one of three diverse image datasets (i.e., CIFAR-100, FashionMNIST, SVHN). We calculate and record each NN's fitness with high frequency-every half epoch-to capture the evolution of the training and validation process. As a result, for each NN, we record the generated parameter values describing the structure of that NN, the image dataset on which the NN trained, and all loss and accuracy values for the NN every half epoch. We put our dataset to the service of researchers studying NN performance and its evolution throughout training and validation. Statistical methods can be applied to our dataset to analyze the shape of learning curves in diverse NNs, and the relationship between an NN's structure and its fitness. Additionally, the structural data and metadata that we record enable the reconstruction and reproducibility of the associated NN.

Collapse

Cui F, Li S, Zhang Z, Sui M, Cao C, El-Latif Hesham A, Zou Q. DeepMC-iNABP: Deep learning for multiclass identification and classification of nucleic acid-binding proteins. Comput Struct Biotechnol J 2022;20:2020-2028. [PMID: 35521556 PMCID: PMC9065708 DOI: 10.1016/j.csbj.2022.04.029] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/06/2022] [Accepted: 04/20/2022] [Indexed: 11/29/2022] Open

Madani M, Lin K, Tarakanova A. DSResSol: A Sequence-Based Solubility Predictor Created with Dilated Squeeze Excitation Residual Networks. Int J Mol Sci 2021;22:13555. [PMID: 34948354 PMCID: PMC8704505 DOI: 10.3390/ijms222413555] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Revised: 12/13/2021] [Accepted: 12/14/2021] [Indexed: 11/16/2022] Open

Abstract

Protein solubility is an important thermodynamic parameter that is critical for the characterization of a protein's function, and a key determinant for the production yield of a protein in both the research setting and within industrial (e.g., pharmaceutical) applications. Experimental approaches to predict protein solubility are costly, time-consuming, and frequently offer only low success rates. To reduce cost and expedite the development of therapeutic and industrially relevant proteins, a highly accurate computational tool for predicting protein solubility from protein sequence is sought. While a number of in silico prediction tools exist, they suffer from relatively low prediction accuracy, bias toward the soluble proteins, and limited applicability for various classes of proteins. In this study, we developed a novel deep learning sequence-based solubility predictor, DSResSol, that takes advantage of the integration of squeeze excitation residual networks with dilated convolutional neural networks and outperforms all existing protein solubility prediction models. This model captures the frequently occurring amino acid k-mers and their local and global interactions and highlights the importance of identifying long-range interaction information between amino acid k-mers to achieve improved accuracy, using only protein sequence as input. DSResSol outperforms all available sequence-based solubility predictors by at least 5% in terms of accuracy when evaluated by two different independent test sets. Compared to existing predictors, DSResSol not only reduces prediction bias for insoluble proteins but also predicts soluble proteins within the test sets with an accuracy that is at least 13% higher than existing models. We derive the key amino acids, dipeptides, and tripeptides contributing to protein solubility, identifying glutamic acid and serine as critical amino acids for protein solubility prediction. Overall, DSResSol can be used for the fast, reliable, and inexpensive prediction of a protein's solubility to guide experimental design.

Collapse

Villegas-Morcillo A, Gomez AM, Morales-Cordovilla JA, Sanchez V. Protein Fold Recognition From Sequences Using Convolutional and Recurrent Neural Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:2848-2854. [PMID: 32750896 DOI: 10.1109/tcbb.2020.3012732] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]

Sandaruwan PD, Wannige CT. An improved deep learning model for hierarchical classification of protein families. PLoS One 2021;16:e0258625. [PMID: 34669708 PMCID: PMC8528337 DOI: 10.1371/journal.pone.0258625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 10/01/2021] [Indexed: 12/28/2022] Open

Villegas-Morcillo A, Sanchez V, Gomez AM. FoldHSphere: deep hyperspherical embeddings for protein fold recognition. BMC Bioinformatics 2021;22:490. [PMID: 34641786 PMCID: PMC8507389 DOI: 10.1186/s12859-021-04419-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 09/29/2021] [Indexed: 12/01/2022] Open

Carrillo-Cabada H, Benson J, Razavi AM, Mulligan B, Cuendet MA, Weinstein H, Taufer M, Estrada T. A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:1336-1349. [PMID: 31603792 PMCID: PMC9119144 DOI: 10.1109/tcbb.2019.2945291] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Zhang J, Chen Q, Liu B. DeepDRBP-2L: A New Genome Annotation Predictor for Identifying DNA-Binding Proteins and RNA-Binding Proteins Using Convolutional Neural Network and Long Short-Term Memory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021;18:1451-1463. [PMID: 31722485 DOI: 10.1109/tcbb.2019.2952338] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]

Sanner MF, Dieguez L, Forli S, Lis E. Improving Docking Power for Short Peptides Using Random Forest. J Chem Inf Model 2021;61:3074-3090. [PMID: 34124893 PMCID: PMC8543977 DOI: 10.1021/acs.jcim.1c00573] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Li G, Zrimec J, Ji B, Geng J, Larsbrink J, Zelezniak A, Nielsen J, Engqvist MK. Performance of Regression Models as a Function of Experiment Noise. Bioinform Biol Insights 2021;15:11779322211020315. [PMID: 34262264 PMCID: PMC8243133 DOI: 10.1177/11779322211020315] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/29/2021] [Indexed: 11/21/2022] Open

Jin X, Liao Q, Liu B. S2L-PSIBLAST: a supervised two-layer search framework based on PSI-BLAST for protein remote homology detection. Bioinformatics 2021;37:4321-4327. [PMID: 34170287 DOI: 10.1093/bioinformatics/btab472] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 05/29/2021] [Accepted: 06/24/2021] [Indexed: 01/26/2023] Open

Jiang H, Fan X. The Two-Step Clustering Approach for Metastable States Learning. Int J Mol Sci 2021;22:6576. [PMID: 34205252 PMCID: PMC8233889 DOI: 10.3390/ijms22126576] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 06/14/2021] [Accepted: 06/14/2021] [Indexed: 01/20/2023] Open

Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity. Interdiscip Sci 2021;13:693-702. [PMID: 34143353 DOI: 10.1007/s12539-021-00448-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 05/31/2021] [Accepted: 06/04/2021] [Indexed: 10/21/2022]

New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies. Neural Comput Appl 2021;33:15669-15692. [PMID: 34155424 PMCID: PMC8208613 DOI: 10.1007/s00521-021-06188-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 06/02/2021] [Indexed: 12/13/2022]

Osadchy M, Kolodny R. How Deep Learning Tools Can Help Protein Engineers Find Good Sequences. J Phys Chem B 2021;125:6440-6450. [PMID: 34105961 DOI: 10.1021/acs.jpcb.1c02449] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]