1
|
Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q, Zheng W. The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction. Biomolecules 2024; 14:1531. [PMID: 39766238 PMCID: PMC11673352 DOI: 10.3390/biom14121531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/24/2024] [Accepted: 11/27/2024] [Indexed: 01/11/2025] Open
Abstract
Multiple sequence alignment (MSA) has evolved into a fundamental tool in the biological sciences, playing a pivotal role in predicting molecular structures and functions. With broad applications in protein and nucleic acid modeling, MSAs continue to underpin advancements across a range of disciplines. MSAs are not only foundational for traditional sequence comparison techniques but also increasingly important in the context of artificial intelligence (AI)-driven advancements. Recent breakthroughs in AI, particularly in protein and nucleic acid structure prediction, rely heavily on the accuracy and efficiency of MSAs to enhance remote homology detection and guide spatial restraints. This review traces the historical evolution of MSA, highlighting its significance in molecular structure and function prediction. We cover the methodologies used for protein monomers, protein complexes, and RNA, while also exploring emerging AI-based alternatives, such as protein language models, as complementary or replacement approaches to traditional MSAs in application tasks. By discussing the strengths, limitations, and applications of these methods, this review aims to provide researchers with valuable insights into MSA's evolving role, equipping them to make informed decisions in structural prediction research.
Collapse
Affiliation(s)
- Chenyue Zhang
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Qinxin Wang
- Suzhou New & High-Tech Innovation Service Center, Suzhou 215011, China;
| | - Yiyang Li
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Anqi Teng
- Bioscience and Biomedical Engineering Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China;
| | - Gang Hu
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Qiqige Wuyun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Wei Zheng
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
2
|
Nandigrami P, Fiser A. Assessing the functional impact of protein binding site definition. Protein Sci 2024; 33:e5026. [PMID: 38757384 PMCID: PMC11099757 DOI: 10.1002/pro.5026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 05/01/2024] [Accepted: 05/03/2024] [Indexed: 05/18/2024]
Abstract
Many biomedical applications, such as classification of binding specificities or bioengineering, depend on the accurate definition of protein binding interfaces. Depending on the choice of method used, substantially different sets of residues can be classified as belonging to the interface of a protein. A typical approach used to verify these definitions is to mutate residues and measure the impact of these changes on binding. Besides the lack of exhaustive data, this approach also suffers from the fundamental problem that a mutation introduces an unknown amount of alteration into an interface, which potentially alters the binding characteristics of the interface. In this study we explore the impact of alternative binding site definitions on the ability of a protein to recognize its cognate ligand using a pharmacophore approach, which does not affect the interface. The study also shows that methods for protein binding interface predictions should perform above approximately F-score = 0.7 accuracy level to capture the biological function of a protein.
Collapse
Affiliation(s)
- Prithviraj Nandigrami
- Departments of Systems and Computational Biology, and BiochemistryAlbert Einstein College of MedicineBronxNew YorkUSA
| | - Andras Fiser
- Departments of Systems and Computational Biology, and BiochemistryAlbert Einstein College of MedicineBronxNew YorkUSA
| |
Collapse
|
3
|
Go SR, Lee SJ, Ahn WC, Park KH, Woo EJ. Enhancing the thermostability and activity of glycosyltransferase UGT76G1 via computational design. Commun Chem 2023; 6:265. [PMID: 38057441 DOI: 10.1038/s42004-023-01070-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 11/21/2023] [Indexed: 12/08/2023] Open
Abstract
The diterpene glycosyltransferase UGT76G1, derived from Stevia rebaudiana, plays a pivotal role in the biosynthesis of rebaudioside A, a natural sugar substitute. Nevertheless, its potential for industrial application is limited by certain enzymatic characteristics, notably thermostability. To enhance the thermostability and enzymatic activity, we employed a computational design strategy, merging stabilizing mutation scanning with a Rosetta-based protein design protocol. Compared to UGT76G1, the designed variant 76_4 exhibited a 9 °C increase in apparent Tm, a 2.55-fold increase rebaudioside A production capacity, and a substantial 11% reduction in the undesirable byproduct rebaudioside I. Variant 76_7 also showed a 1.91-fold enhancement rebaudioside A production capacity, which was maintained up to 55 °C, while the wild-type lost most of its activity. These results underscore the efficacy of structure-based design in introducing multiple mutations simultaneously, which significantly improves the enzymatic properties of UGT76G1. This strategy provides a method for the development of efficient, thermostable enzymes for industrial applications.
Collapse
Affiliation(s)
- Seong-Ryeong Go
- Critical Diseases Diagnostics Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Department of Proteome Structural Biology, KRIBB School of Bioscience, University of Science and Technology (UST), Daejeon, 34113, Republic of Korea
| | - Su-Jin Lee
- Critical Diseases Diagnostics Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
- Department of Proteome Structural Biology, KRIBB School of Bioscience, University of Science and Technology (UST), Daejeon, 34113, Republic of Korea
| | - Woo-Chan Ahn
- Critical Diseases Diagnostics Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea
| | - Kwang-Hyun Park
- Critical Diseases Diagnostics Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
- Department of Proteome Structural Biology, KRIBB School of Bioscience, University of Science and Technology (UST), Daejeon, 34113, Republic of Korea.
| | - Eui-Jeon Woo
- Department of Proteome Structural Biology, KRIBB School of Bioscience, University of Science and Technology (UST), Daejeon, 34113, Republic of Korea.
- Disease Target Structure Research Center, Korea Research Institute of Bioscience & Biotechnology (KRIBB), Daejeon, 34141, Republic of Korea.
| |
Collapse
|
4
|
Grudman S, Fajardo JE, Fiser A. Optimal selection of suitable templates in protein interface prediction. Bioinformatics 2023; 39:btad510. [PMID: 37603727 PMCID: PMC10491951 DOI: 10.1093/bioinformatics/btad510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/11/2023] [Accepted: 08/18/2023] [Indexed: 08/23/2023] Open
Abstract
MOTIVATION Molecular-level classification of protein-protein interfaces can greatly assist in functional characterization and rational drug design. The most accurate protein interface predictions rely on finding homologous proteins with known interfaces since most interfaces are conserved within the same protein family. The accuracy of these template-based prediction approaches depends on the correct choice of suitable templates. Choosing the right templates in the immunoglobulin superfamily (IgSF) is challenging because its members share low sequence identity and display a wide range of alternative binding sites despite structural homology. RESULTS We present a new approach to predict protein interfaces. First, template-specific, informative evolutionary profiles are established using a mutual information-based approach. Next, based on the similarity of residue level conservation scores derived from the evolutionary profiles, a query protein is hierarchically clustered with all available template proteins in its superfamily with known interface definitions. Once clustered, a subset of the most closely related templates is selected, and an interface prediction is made. These initial interface predictions are subsequently refined by extensive docking. This method was benchmarked on 51 IgSF proteins and can predict nontrivial interfaces of IgSF proteins with an average and median F-score of 0.64 and 0.78, respectively. We also provide a way to assess the confidence of the results. The average and median F-scores increase to 0.8 and 0.81, respectively, if 27% of low confidence cases and 17% of medium confidence cases are removed. Lastly, we provide residue level interface predictions, protein complexes, and confidence measurements for singletons in the IgSF. AVAILABILITY AND IMPLEMENTATION Source code is freely available at: https://gitlab.com/fiserlab.org/interdct_with_refinement.
Collapse
Affiliation(s)
- Steven Grudman
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - J Eduardo Fajardo
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| |
Collapse
|
5
|
Krishnan SR, Soares RRG, Madaboosi N, Gromiha MM. AutoPLP: A Padlock Probe Design Pipeline for Zoonotic Pathogens. ACS Infect Dis 2023; 9:459-469. [PMID: 36790094 DOI: 10.1021/acsinfecdis.2c00436] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
Abstract
Emergence of novel zoonotic infections among the human population has increased the burden on global healthcare systems to curb their spread. To meet the evolutionary agility of pathogens, it is essential to revamp the existing diagnostic methods for early detection and characterization of the pathogens at the molecular level. Padlock probes (PLPs), which can leverage the power of isothermal nucleic acid amplification techniques (NAAT) such as rolling circle amplification (RCA), are known for their high sensitivity and specificity in detecting a diverse pathogen panel of interest. However, due to the complexity involved in deciding the target regions for PLP design and the need for optimization of multiple experimental parameters, the applicability of RCA has been limited in point-of-care testing for pathogen detection. To address this gap, we have developed a novel and integrated PLP design pipeline named AutoPLP, which can automate the probe design process for a diverse pathogen panel of interest. The pipeline is composed of three modules which can perform sequence data curation, multiple sequence alignment, conservation analysis, filtration based on experimental parameters (Tm, GC content, and secondary structure formation), and in silico probe validation via potential cross-hybridization check with host genome. The modules can also take into account the backbone and restriction site information, appropriate combinations of which are incorporated along with the probe arms to design a complete probe sequence. The potential applications of AutoPLP are showcased through the design of PLPs for the detection of rabies virus and drug-resistant strains of Mycobacterium tuberculosis.
Collapse
Affiliation(s)
- Sowmya Ramaswamy Krishnan
- Protein Bioinformatics Lab, Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India.,TCS Research (Life Sciences Division), Tata Consultancy Services, Hyderabad 500081, India
| | - Ruben R G Soares
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna SE-17121, Sweden
| | - Narayanan Madaboosi
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India
| | - M Michael Gromiha
- Protein Bioinformatics Lab, Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India.,International Research Frontiers Initiative, School of Computing, Tokyo Institute of Technology, Yokohama 226-8501, Japan
| |
Collapse
|
6
|
Nandigrami P, Fiser A. Assessing the functional impact of protein binding site definition. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.26.525812. [PMID: 36747792 PMCID: PMC9900911 DOI: 10.1101/2023.01.26.525812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Many biomedical applications, such as classification of binding specificities or bioengineering, depend on the accurate definition of protein binding interfaces. Depending on the choice of method used, substantially different sets of residues can be classified as belonging to the interface of a protein. A typical approach used to verify these definitions is to mutate residues and measure the impact of these changes on binding. Besides the lack of exhaustive data this approach generates, it also suffers from the fundamental problem that a mutation introduces an unknown amount of alteration into an interface, which potentially alters the binding characteristics of the interface. In this study we explore the impact of alternative binding site definitions on the ability of a protein to recognize its cognate ligand using a pharmacophore approach, which does not affect the interface. The study also provides guidance on the minimum expected accuracy of interface definition that is required to capture the biological function of a protein.
Collapse
Affiliation(s)
- Prithviraj Nandigrami
- Departments of Systems & Computational Biology, and Biochemistry, Albert Einstein College of Medicine 1300 Morris Park Ave, Bronx, NY 10461, USA
| | - Andras Fiser
- Departments of Systems & Computational Biology, and Biochemistry, Albert Einstein College of Medicine 1300 Morris Park Ave, Bronx, NY 10461, USA
| |
Collapse
|
7
|
Liu Z, Yu DJ. cpxDeepMSA: A Deep Cascade Algorithm for Constructing Multiple Sequence Alignments of Protein–Protein Interactions. Int J Mol Sci 2022; 23:ijms23158459. [PMID: 35955594 PMCID: PMC9369210 DOI: 10.3390/ijms23158459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Revised: 07/18/2022] [Accepted: 07/28/2022] [Indexed: 12/10/2022] Open
Abstract
Protein–protein interactions (PPIs) are fundamental to many biological processes. The coevolution-based prediction of interacting residues has made great strides in protein complexes that are known to interact. A multiple sequence alignment (MSA) is the basis of coevolution analysis. MSAs have recently made significant progress in the protein monomer sequence analysis. However, no standard or efficient pipelines are available for the sensitive protein complex MSA (cpxMSA) collection. How to generate cpxMSA is one of the most challenging problems of sequence coevolution analysis. Although several methods have been developed to address this problem, no standalone program exists. Furthermore, the number of built-in properties is limited; hence, it is often difficult for users to analyze sequence coevolution according to their desired cpxMSA. In this article, we developed a novel cpxMSA approach (cpxDeepMSA. We used different protein monomer databases and incorporated the three strategies (genomic distance, phylogeny information, and STRING interaction network) used to join the monomer MSA results of protein complexes, which can prevent using a single method fail to the joint two-monomer MSA causing the cpxMSA construction failure. We anticipate that the cpxDeepMSA algorithm will become a useful high-throughput tool in protein complex structure predictions, inter-protein residue-residue contacts, and the biological sequence coevolution analysis.
Collapse
|
8
|
Walder M, Edelstein E, Carroll M, Lazarev S, Fajardo JE, Fiser A, Viswanathan R. Integrated structure-based protein interface prediction. BMC Bioinformatics 2022; 23:301. [PMID: 35879651 PMCID: PMC9316365 DOI: 10.1186/s12859-022-04852-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Accepted: 07/18/2022] [Indexed: 11/29/2022] Open
Abstract
Background Identifying protein interfaces can inform how proteins interact with their binding partners, uncover the regulatory mechanisms that control biological functions and guide the development of novel therapeutic agents. A variety of computational approaches have been developed for predicting a protein’s interfacial residues from its known sequence and structure. Methods using the known three-dimensional structures of proteins can be template-based or template-free. Template-based methods have limited success in predicting interfaces when homologues with known complex structures are not available to use as templates. The prediction performance of template-free methods that only rely only upon proteins’ intrinsic properties is limited by the amount of biologically relevant features that can be included in an interface prediction model. Results We describe the development of an integrated method for protein interface prediction (ISPIP) to explore the hypothesis that the efficacy of a computational prediction method of protein binding sites can be enhanced by using a combination of methods that rely on orthogonal structure-based properties of a query protein, combining and balancing both template-free and template-based features. ISPIP is a method that integrates these approaches through simple linear or logistic regression models and more complex decision tree models. On a diverse test set of 156 query proteins, ISPIP outperforms each of its individual classifiers in identifying protein binding interfaces. Conclusions The integrated method captures the best performance of individual classifiers and delivers an improved interface prediction. The method is robust and performs well even when one of the individual classifiers performs poorly on a particular query protein. This work demonstrates that integrating orthogonal methods that depend on different structural properties of proteins performs better at interface prediction than any individual classifier alone. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04852-2.
Collapse
Affiliation(s)
- M Walder
- Department of Chemistry, Yeshiva College, Yeshiva University, New York, NY, 10033, USA
| | - E Edelstein
- Department of Chemistry, Yeshiva College, Yeshiva University, New York, NY, 10033, USA
| | - M Carroll
- Department of Chemistry, Yeshiva College, Yeshiva University, New York, NY, 10033, USA
| | - S Lazarev
- Department of Chemistry, Yeshiva College, Yeshiva University, New York, NY, 10033, USA
| | - J E Fajardo
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, 10461, USA
| | - A Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY, 10461, USA
| | - R Viswanathan
- Department of Chemistry, Yeshiva College, Yeshiva University, New York, NY, 10033, USA.
| |
Collapse
|
9
|
Kostenko DO, Korotkov EV. Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences. Int J Mol Sci 2022; 23:ijms23073764. [PMID: 35409125 PMCID: PMC8998981 DOI: 10.3390/ijms23073764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 03/23/2022] [Accepted: 03/23/2022] [Indexed: 12/10/2022] Open
Abstract
The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.6, a total of 81 sets. Comparison of the performance of sequence alignments constructed by MAHDS and previously developed algorithms using the CS and Z score criteria and the benchmark alignment database (BAliBASE) indicated that, although the quality of the alignments built with MAHDS was somewhat lower than that of the other algorithms, it was compensated by greater statistical significance. MAHDS could construct statistically significant alignments of artificial sequences with x ≤ 4.8, whereas the other algorithms (T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK) could not perform that at x > 2.4. The application of MAHDS to align 21 families of highly diverged proteins (identity < 20%) from Pfam and HOMSTRAD databases showed that it could calculate statistically significant alignments in cases when the other methods failed. Thus, MAHDS could be used to construct statistically significant multiple alignments of highly divergent protein sequences, which accumulated multiple mutations during evolution.
Collapse
|
10
|
Grudman S, Fajardo JE, Fiser A. INTERCAAT: identifying interface residues between macromolecules. Bioinformatics 2021; 38:554-555. [PMID: 34499117 PMCID: PMC8722752 DOI: 10.1093/bioinformatics/btab596] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 07/21/2021] [Accepted: 09/08/2021] [Indexed: 02/03/2023] Open
Abstract
SUMMARY The Interface Contact definition with Adaptable Atom Types (INTERCAAT) was developed to determine the atomic interactions between molecules that form a known three dimensional structure. First, INTERCAAT creates a Voronoi tessellation where each atom acts as a seed. Interactions are defined by atoms that share a hyperplane and whose distance is less than the sum of each atoms' Van der Waals radii plus the diameter of a solvent molecule. Interacting atoms are then classified and interactions are filtered based on compatibility. INTERCAAT implements an adaptive atom classification method; therefore, it can explore interfaces between a variety macromolecules. AVAILABILITY AND IMPLEMENTATION Source code is freely available at: https://gitlab.com/fiserlab.org/intercaat. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Steven Grudman
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - J Eduardo Fajardo
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | | |
Collapse
|
11
|
Scherer M, Fleishman SJ, Jones PR, Dandekar T, Bencurova E. Computational Enzyme Engineering Pipelines for Optimized Production of Renewable Chemicals. Front Bioeng Biotechnol 2021; 9:673005. [PMID: 34211966 PMCID: PMC8239229 DOI: 10.3389/fbioe.2021.673005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 05/06/2021] [Indexed: 11/13/2022] Open
Abstract
To enable a sustainable supply of chemicals, novel biotechnological solutions are required that replace the reliance on fossil resources. One potential solution is to utilize tailored biosynthetic modules for the metabolic conversion of CO2 or organic waste to chemicals and fuel by microorganisms. Currently, it is challenging to commercialize biotechnological processes for renewable chemical biomanufacturing because of a lack of highly active and specific biocatalysts. As experimental methods to engineer biocatalysts are time- and cost-intensive, it is important to establish efficient and reliable computational tools that can speed up the identification or optimization of selective, highly active, and stable enzyme variants for utilization in the biotechnological industry. Here, we review and suggest combinations of effective state-of-the-art software and online tools available for computational enzyme engineering pipelines to optimize metabolic pathways for the biosynthesis of renewable chemicals. Using examples relevant for biotechnology, we explain the underlying principles of enzyme engineering and design and illuminate future directions for automated optimization of biocatalysts for the assembly of synthetic metabolic pathways.
Collapse
Affiliation(s)
- Marc Scherer
- Department of Bioinformatics, Julius-Maximilians University of Würzburg, Würzburg, Germany
| | - Sarel J Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, Israel
| | - Patrik R Jones
- Department of Life Sciences, Imperial College London, London, United Kingdom
| | - Thomas Dandekar
- Department of Bioinformatics, Julius-Maximilians University of Würzburg, Würzburg, Germany
| | - Elena Bencurova
- Department of Bioinformatics, Julius-Maximilians University of Würzburg, Würzburg, Germany
| |
Collapse
|
12
|
Valiente-Mullor C, Beamud B, Ansari I, Francés-Cuesta C, García-González N, Mejía L, Ruiz-Hueso P, González-Candelas F. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads. PLoS Comput Biol 2021; 17:e1008678. [PMID: 33503026 PMCID: PMC7870062 DOI: 10.1371/journal.pcbi.1008678] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 02/08/2021] [Accepted: 01/05/2021] [Indexed: 12/17/2022] Open
Abstract
Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended. Mapping consists in the alignment of reads (i.e., DNA fragments) obtained through high-throughput genome sequencing to a previously assembled reference sequence. It is a common practice in genomic studies to use a single reference for mapping, usually the ‘reference genome’ of a species—a high-quality assembly. However, the selection of an optimal reference is hindered by intrinsic intra-species genetic variability, particularly in bacteria. It is known that genetic differences between the reference genome and the read sequences may produce incorrect alignments during mapping. Eventually, these errors could lead to misidentification of variants and biased reconstruction of phylogenetic trees (which reflect ancestry between different bacterial lineages). To our knowledge, this is the first work to systematically examine the effect of different references for mapping on the inference of tree topology as well as the impact on recombination and natural selection inferences. Furthermore, the novelty of this work relies on a procedure that guarantees that we are evaluating only the effect of the reference. This effect has proved to be pervasive in the five bacterial species that we have studied and, in some cases, alterations in phylogenetic trees could lead to incorrect epidemiological inferences. Hence, the use of different reference genomes may be prescriptive to assess the potential biases of mapping.
Collapse
Affiliation(s)
- Carlos Valiente-Mullor
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Beatriz Beamud
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- * E-mail: (BB); (FG-C)
| | - Iván Ansari
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Carlos Francés-Cuesta
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Neris García-González
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Lorena Mejía
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- Instituto de Microbiología, Colegio de Ciencias Biológicas y Ambientales, Universidad San Francisco de Quito, Quito, Ecuador
| | - Paula Ruiz-Hueso
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
| | - Fernando González-Candelas
- Joint Research Unit “Infection and Public Health” FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, Spain
- CIBER in Epidemiology and Public Health, Valencia, Spain
- * E-mail: (BB); (FG-C)
| |
Collapse
|
13
|
Motoyama T, Hiramatsu N, Asano Y, Nakano S, Ito S. Protein Sequence Selection Method That Enables Full Consensus Design of Artificial l-Threonine 3-Dehydrogenases with Unique Enzymatic Properties. Biochemistry 2020; 59:3823-3833. [PMID: 32945652 DOI: 10.1021/acs.biochem.0c00570] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Exponentially increasing protein sequence data enables artificial enzyme design using sequence-based protein design methods, including full-consensus protein design (FCD). The success of artificial enzyme design is strongly dependent on the nature of the sequences used. Hence, sequences must be selected from databases and curated libraries prepared to enable a successful design by FCD. In this study, we proposed a selection approach regarding several key residues as sequence motifs. We used l-threonine 3-dehydrogenase (TDH) as a model to test the validity of this approach. In the classification, four residues (143, 174, 188, and 214) were used as key residues. We classified thousands of TDH homologous sequences into five groups containing hundreds of sequences. Utilizing sequences in the libraries, we designed five artificial TDHs by FCD. Among the five, we successfully expressed four in soluble form. Biochemical analysis of artificial TDHs indicated that their enzymatic properties vary; half of the maximum measured enzyme activity (t1/2) and activation energies were distributed from 53 to 65 °C and from 38 to 125 kJ/mol, respectively. The artificial TDHs had unique kinetic parameters, distinct from one another. Structural analysis indicates that consensus mutations are mainly introduced in the secondary or outer shell. The functional diversity of the artificial TDHs is due to the accumulation of mutations that affect their physicochemical properties. Taken together, our findings indicate that our proposed approach can help generate artificial enzymes with unique enzymatic properties.
Collapse
Affiliation(s)
- Tomoharu Motoyama
- Graduate School of Integrated Pharmaceutical and Nutritional Sciences, University of Shizuoka, 52-1 Yada, Suruga-ku, Shizuoka 422-8526, Japan
| | - Nozomi Hiramatsu
- Graduate School of Integrated Pharmaceutical and Nutritional Sciences, University of Shizuoka, 52-1 Yada, Suruga-ku, Shizuoka 422-8526, Japan
| | - Yasuhisa Asano
- Biotechnology Research Center and Department of Biotechnology, Toyama Prefectural University, 5180 Kurokawa, Imizu, Toyama 939-0398, Japan
| | - Shogo Nakano
- Graduate School of Integrated Pharmaceutical and Nutritional Sciences, University of Shizuoka, 52-1 Yada, Suruga-ku, Shizuoka 422-8526, Japan
| | - Sohei Ito
- Graduate School of Integrated Pharmaceutical and Nutritional Sciences, University of Shizuoka, 52-1 Yada, Suruga-ku, Shizuoka 422-8526, Japan
| |
Collapse
|
14
|
Zhang C, Zheng W, Mortuza SM, Li Y, Zhang Y. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 2020; 36:2105-2112. [PMID: 31738385 DOI: 10.1093/bioinformatics/btz863] [Citation(s) in RCA: 110] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 10/17/2019] [Accepted: 11/15/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. RESULTS We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/DeepMSA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
15
|
Shrestha R, Fajardo E, Gil N, Fidelis K, Kryshtafovych A, Monastyrskyy B, Fiser A. Assessing the accuracy of contact predictions in CASP13. Proteins 2019; 87:1058-1068. [PMID: 31587357 PMCID: PMC6851495 DOI: 10.1002/prot.25819] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 09/17/2019] [Accepted: 09/17/2019] [Indexed: 01/07/2023]
Abstract
The accuracy of sequence-based tertiary contact predictions was assessed in a blind prediction experiment at the CASP13 meeting. After 4 years of significant improvements in prediction accuracy, another dramatic advance has taken place since CASP12 was held 2 years ago. The precision of predicting the top L/5 contacts in the free modeling category, where L is the corresponding length of the protein in residues, has exceeded 70%. As a comparison, the best-performing group at CASP12 with a 47% precision would have finished below the top 1/3 of the CASP13 groups. Extensively trained deep neural network approaches dominate the top performing algorithms, which appear to efficiently integrate information on coevolving residues and interacting fragments or possibly utilize memories of sequence similarities and sometimes can deliver accurate results even in the absence of virtually any target specific evolutionary information. If the current performance is evaluated by F-score on L contacts, it stands around 24% right now, which, despite the tremendous impact and advance in improving its utility for structure modeling, also suggests that there is much room left for further improvement.
Collapse
Affiliation(s)
- Rojan Shrestha
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Eduardo Fajardo
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Nelson Gil
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Bohdan Monastyrskyy
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| |
Collapse
|
16
|
Gil N, Fajardo EJ, Fiser A. Discovery of receptor-ligand interfaces in the immunoglobulin superfamily. Proteins 2019; 88:135-142. [PMID: 31298437 DOI: 10.1002/prot.25778] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Revised: 06/21/2019] [Accepted: 07/06/2019] [Indexed: 12/13/2022]
Abstract
Cell-surface-anchored immunoglobulin superfamily (IgSF) proteins are widespread throughout the human proteome, forming crucial components of diverse biological processes including immunity, cell-cell adhesion, and carcinogenesis. IgSF proteins generally function through protein-protein interactions carried out between extracellular, membrane-bound proteins on adjacent cells, known as trans-binding interfaces. These protein-protein interactions constitute a class of pharmaceutical targets important in the treatment of autoimmune diseases, chronic infections, and cancer. A molecular-level understanding of IgSF protein-protein interactions would greatly benefit further drug development. A critical step toward this goal is the reliable identification of IgSF trans-binding interfaces. We propose a novel combination of structure and sequence information to identify trans-binding interfaces in IgSF proteins. We developed a structure-based binding interface prediction approach that can identify broad regions of the protein surface that encompass the binding interfaces and suggests that IgSF proteins possess binding supersites. These interfaces could theoretically be pinpointed using sequence-based conservation analysis, with performance approaching the theoretical upper limit of binding interface prediction accuracy, but achieving this in practice is limited by the current ability to identify an appropriate multiple sequence alignment for conservation analysis. However, an important contribution of combining the two orthogonal methods is that agreement between these approaches can estimate the reliability of the predictions. This approach was benchmarked on the set of 22 IgSF proteins with experimentally solved structures in complex with their ligands. Additionally, we provide structure-based predictions and reliability scores for the 62 IgSF proteins with known structure but yet uncharacterized binding interfaces.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York.,Department of Biochemistry, Albert Einstein College of Medicine, Bronx, New York
| | - Eduardo J Fajardo
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York.,Department of Biochemistry, Albert Einstein College of Medicine, Bronx, New York
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York.,Department of Biochemistry, Albert Einstein College of Medicine, Bronx, New York
| |
Collapse
|