Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Schütze K, Heinzinger M, Steinegger M, Rost B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. Front Bioinform 2022;2:1033775. [PMID: 36466147 PMCID: PMC9714024 DOI: 10.3389/fbinf.2022.1033775] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/31/2022] [Indexed: 11/29/2023] Open

For:	Schütze K, Heinzinger M, Steinegger M, Rost B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. Front Bioinform 2022;2:1033775. [PMID: 36466147 PMCID: PMC9714024 DOI: 10.3389/fbinf.2022.1033775] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/31/2022] [Indexed: 11/29/2023] Open

Number

Cited by Other Article(s)

Olson D, Colligan T, Demekas D, Roddy JW, Youens-Clark K, Wheeler TJ. NEAR: Neural Embeddings for Amino acid Relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.01.25.577287. [PMID: 39896534 PMCID: PMC11785008 DOI: 10.1101/2024.01.25.577287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]

Heinzinger M, Rost B. Teaching AI to speak protein. Curr Opin Struct Biol 2025;91:102986. [PMID: 39985945 DOI: 10.1016/j.sbi.2025.102986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 12/30/2024] [Accepted: 01/02/2025] [Indexed: 02/24/2025]

Weissenow K, Rost B. Are protein language models the new universal key? Curr Opin Struct Biol 2025;91:102997. [PMID: 39921962 DOI: 10.1016/j.sbi.2025.102997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 12/20/2024] [Accepted: 01/16/2025] [Indexed: 02/10/2025]

Luo J, Luo Y. Learning maximally spanning representations improves protein function annotation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.13.638156. [PMID: 40027840 PMCID: PMC11870436 DOI: 10.1101/2025.02.13.638156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]

Abstract

Automated protein function annotation is a fundamental problem in computational biology, crucial for understanding the functional roles of proteins in biological processes, with broad implications in medicine and biotechnology. A persistent challenge in this problem is the imbalanced, long-tail distribution of available function annotations: a small set of well-studied function classes account for most annotated proteins, while many other classes have few annotated proteins, often due to investigative bias, experimental limitations, or intrinsic biases in protein evolution. As a result, existing machine learning models for protein function prediction tend to only optimize the prediction accuracy for well-studied function classes overrepresented in the training data, leading to poor accuracy for understudied functions. In this work, we develop MSRep, a novel deep learning-based protein function annotation framework designed to address this imbalance issue and improve annotation accuracy. MSRep is inspired by an intriguing phenomenon, called neural collapse (NC), commonly observed in high-accuracy deep neural networks used for classification tasks, where hidden representations in the final layer collapse to class-specific mean embeddings, while maintaining maximal inter-class separation. Given that NC consistently emerges across diverse architectures and tasks for high-accuracy models, we hypothesize that inducing NC structure in models trained on imbalanced data can enhance both prediction accuracy and generalizability. To achieve this, MSRep refines a pre-trained protein language model to produce NC-like representations by optimizing an NC-inspired loss function, which ensures that minority functions are equally represented in the embedding space as majority functions, in contrast to conventional classification methods whose embedding spaces are dominated by overrepresented classes. In evaluations across four protein function annotation tasks on the prediction of Enzyme Commission numbers, Gene3D codes, Pfam families, and Gene Ontology terms, MSRep demonstrates superior predictive performance for both well- and underrepresented classes, outperforming several state-of-the-art annotation tools. We anticipate that MSRep will enhance the annotation of understudied functions and novel, uncharacterized proteins, advancing future protein function studies and accelerating the discovery of new functional proteins. The source code of MSRep is available at https://github.com/luo-group/MSRep.

Collapse

Kilinc M, Jia K, Jernigan RL. Major advances in protein function assignment by remote homolog detection with protein language models - A review. Curr Opin Struct Biol 2025;90:102984. [PMID: 39864241 DOI: 10.1016/j.sbi.2025.102984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 12/23/2024] [Accepted: 01/02/2025] [Indexed: 01/28/2025]

Kulikova AV, Parker JK, Davies BW, Wilke CO. Semantic search using protein large language models detects class II microcins in bacterial genomes. mSystems 2024;9:e0104424. [PMID: 39291976 PMCID: PMC11494933 DOI: 10.1128/msystems.01044-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2024] [Accepted: 08/28/2024] [Indexed: 09/19/2024] Open

Iovino BG, Tang H, Ye Y. Protein domain embeddings for fast and accurate similarity search. Genome Res 2024;34:1434-1444. [PMID: 39237301 PMCID: PMC11529836 DOI: 10.1101/gr.279127.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 09/03/2024] [Indexed: 09/07/2024]

Erckert K, Rost B. Assessing the role of evolutionary information for enhancing protein language model embeddings. Sci Rep 2024;14:20692. [PMID: 39237735 PMCID: PMC11377704 DOI: 10.1038/s41598-024-71783-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 08/30/2024] [Indexed: 09/07/2024] Open

Heinzinger M, Rost B. Artificial Intelligence Learns Protein Prediction. Cold Spring Harb Perspect Biol 2024;16:a041458. [PMID: 38858069 PMCID: PMC11368192 DOI: 10.1101/cshperspect.a041458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2024]

Clark CM, Kwan JC. Creating and leveraging bespoke large-scale knowledge graphs for comparative genomics and multi-omics drug discovery with SocialGene. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.16.608329. [PMID: 39229008 PMCID: PMC11370487 DOI: 10.1101/2024.08.16.608329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]

Dickson A, Mofrad MRK. Fine-tuning protein embeddings for functional similarity evaluation. Bioinformatics 2024;40:btae445. [PMID: 38985218 PMCID: PMC11299545 DOI: 10.1093/bioinformatics/btae445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/11/2024] Open

Johnson SR, Peshwa M, Sun Z. Sensitive remote homology search by local alignment of small positional embeddings from protein language models. eLife 2024;12:RP91415. [PMID: 38488154 PMCID: PMC10942778 DOI: 10.7554/elife.91415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2024] Open

Iovino BG, Ye Y. Protein embedding based alignment. BMC Bioinformatics 2024;25:85. [PMID: 38413857 PMCID: PMC10900708 DOI: 10.1186/s12859-024-05699-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 02/12/2024] [Indexed: 02/29/2024] Open

Abstract

PURPOSE

Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model.

METHODS

We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances.

RESULTS

PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods.

CONCLUSION

Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.

Collapse

Roddy JW, Rich DH, Wheeler TJ. nail: software for high-speed, high-sensitivity protein sequence annotation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.27.577580. [PMID: 38352323 PMCID: PMC10862755 DOI: 10.1101/2024.01.27.577580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]

Pantolini L, Studer G, Pereira J, Durairaj J, Tauriello G, Schwede T. Embedding-based alignment: combining protein language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics 2024;40:btad786. [PMID: 38175775 PMCID: PMC10792726 DOI: 10.1093/bioinformatics/btad786] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 10/27/2023] [Accepted: 12/29/2023] [Indexed: 01/06/2024] Open

Kulikova AV, Parker JK, Davies BW, Wilke CO. Semantic search using protein large language models detects class II microcins in bacterial genomes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.15.567263. [PMID: 38014091 PMCID: PMC10680697 DOI: 10.1101/2023.11.15.567263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]

Kaminski K, Ludwiczak J, Pawlicki K, Alva V, Dunin-Horkawicz S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models. Bioinformatics 2023;39:btad579. [PMID: 37725369 PMCID: PMC10576641 DOI: 10.1093/bioinformatics/btad579] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Revised: 07/09/2023] [Accepted: 09/15/2023] [Indexed: 09/21/2023] Open

Sala D, Engelberger F, Mchaourab HS, Meiler J. Modeling conformational states of proteins with AlphaFold. Curr Opin Struct Biol 2023;81:102645. [PMID: 37392556 DOI: 10.1016/j.sbi.2023.102645] [Citation(s) in RCA: 76] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 05/16/2023] [Accepted: 06/01/2023] [Indexed: 07/03/2023]