1
|
Grahame DSA, Dupuis JH, Bryksa BC, Tanaka T, Yada RY. Comparative bioinformatic and structural analyses of pepsin and renin. Enzyme Microb Technol 2020; 141:109632. [PMID: 33051007 DOI: 10.1016/j.enzmictec.2020.109632] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Revised: 06/25/2020] [Accepted: 07/08/2020] [Indexed: 11/16/2022]
Abstract
Pepsin, the archetypal pepsin-like aspartic protease, is irreversibly denatured when exposed to neutral pH conditions whereas renin, a structural homologue of pepsin, is fully stable and optimally active in the same conditions despite sharing highly similar enzyme architecture. To gain insight into the structural determinants of differential aspartic protease pH stability, the present study used comparative bioinformatic and structural analyses. In pepsin, an abundance of polar and aspartic acid residues were identified, a common trait with other acid-stable enzymes. Conversely, renin was shown to have increased levels of basic amino acids. In both pepsin and renin, the solvent exposure of these charged groups was high. Having similar overall acidic residue content, the solvent-exposed basic residues may allow for extensive salt bridge formation in renin, whereas in pepsin, these residues are protonated and serve to form stabilizing hydrogen bonds at low pH. Relative differences in structure and sequence in the turn and joint regions of the β-barrel and ψ-loop in both the N- and C-terminal lobes were identified as regions of interest in defining divergent pH stability. Compared to the structural rigidity of renin, pepsin has more instability associated with the N-terminus, specifically the B/C connector. By contrast, renin exhibits greater C-terminal instability in turn and connector regions. Overall, flexibility differences in connector regions, and amino acid composition, particularly in turn and joint regions of the β-barrel and ψ-loops, likely play defining roles in determining pH stability for renin and pepsin.
Collapse
Affiliation(s)
- Douglas S A Grahame
- Department of Food Science, Ontario Agricultural College, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - John H Dupuis
- Food, Nutrition, and Health Program, Faculty of Land and Food Systems, University of British Columbia, Vancouver, BC, V6T 1Z4 Canada
| | - Brian C Bryksa
- Department of Food Science, Ontario Agricultural College, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Takuji Tanaka
- Department of Food and Bioproduct Sciences, College of Agriculture and Bioresources, University of Saskatchewan, Saskatoon, SK, S7N 5A8 Canada
| | - Rickey Y Yada
- Department of Food Science, Ontario Agricultural College, University of Guelph, Guelph, ON, N1G 2W1, Canada; Food, Nutrition, and Health Program, Faculty of Land and Food Systems, University of British Columbia, Vancouver, BC, V6T 1Z4 Canada.
| |
Collapse
|
2
|
Abstract
Abstract
Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA fragments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. A deep artificial neural network consists of a group of artificial neurons that mimic the properties of living neurons. These mathematical models, termed Artificial Neural Networks (ANN), can be used to solve artificial intelligence engineering problems in several different technological fields (e.g., biology, genomics, proteomics, and metabolomics). In practical terms, neural networks are non-linear statistical structures that are organized as modelling tools and are used to simulate complex genomic relationships between inputs and outputs. To date, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNN) have been demonstrated to be the best tools for improving performance in problem solving tasks within the genomic field.
Collapse
|
3
|
Abstract
The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Collapse
Affiliation(s)
- Maxwell W Libbrecht
- Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA
| | - William Stafford Noble
- 1] Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, Washington 98195-2350, USA. [2] Department of Genome Sciences, University of Washington, 3720 15th Ave NE Seattle, Washington 98195-5065, USA
| |
Collapse
|
4
|
Ning L, Liu G, Li G, Hou Y, Tong Y, He J. Current challenges in the bioinformatics of single cell genomics. Front Oncol 2014; 4:7. [PMID: 24478987 PMCID: PMC3902584 DOI: 10.3389/fonc.2014.00007] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2013] [Accepted: 01/12/2014] [Indexed: 11/13/2022] Open
Abstract
Single cell genomics is a rapidly growing field with many new techniques emerging in the past few years. However, few bioinformatics tools specific for single cell genomics analysis are available. Single cell DNA/RNA sequencing data usually have low genome coverage and high amplification bias, which makes bioinformatics analysis challenging. Many current bioinformatics tools developed for bulk cell sequencing do not work well with single cell sequencing data. Here, we summarize current challenges in the bioinformatics analysis of single cell genomic DNA sequencing and single cell transcriptomes. These challenges include calling copy number variations, identifying mutated genes in tumor samples, reconstructing cell lineages, recovering low abundant transcripts, and improving the accuracy of quantitative analysis of transcripts. Development in single cell genomics bioinformatics analysis will promote the application of this technology to basic biology and medical research.
Collapse
Affiliation(s)
- Luwen Ning
- Department of Biology, South University of Science and Technology of China , Shenzhen , China
| | | | | | | | - Yin Tong
- Department of Biology, South University of Science and Technology of China , Shenzhen , China
| | - Jiankui He
- Department of Biology, South University of Science and Technology of China , Shenzhen , China
| |
Collapse
|
5
|
Valentin JB, Andreetta C, Boomsma W, Bottaro S, Ferkinghoff-Borg J, Frellsen J, Mardia KV, Tian P, Hamelryck T. Formulation of probabilistic models of protein structure in atomic detail using the reference ratio method. Proteins 2013; 82:288-99. [PMID: 23934827 DOI: 10.1002/prot.24386] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2013] [Revised: 07/02/2013] [Accepted: 07/18/2013] [Indexed: 01/10/2023]
Abstract
We propose a method to formulate probabilistic models of protein structure in atomic detail, for a given amino acid sequence, based on Bayesian principles, while retaining a close link to physics. We start from two previously developed probabilistic models of protein structure on a local length scale, which concern the dihedral angles in main chain and side chains, respectively. Conceptually, this constitutes a probabilistic and continuous alternative to the use of discrete fragment and rotamer libraries. The local model is combined with a nonlocal model that involves a small number of energy terms according to a physical force field, and some information on the overall secondary structure content. In this initial study we focus on the formulation of the joint model and the evaluation of the use of an energy vector as a descriptor of a protein's nonlocal structure; hence, we derive the parameters of the nonlocal model from the native structure without loss of generality. The local and nonlocal models are combined using the reference ratio method, which is a well-justified probabilistic construction. For evaluation, we use the resulting joint models to predict the structure of four proteins. The results indicate that the proposed method and the probabilistic models show considerable promise for probabilistic protein structure prediction and related applications.
Collapse
Affiliation(s)
- Jan B Valentin
- The Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Mardia KV. Statistical approaches to three key challenges in protein structural bioinformatics. J R Stat Soc Ser C Appl Stat 2013. [DOI: 10.1111/rssc.12003] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Philips A, Milanowska K, Lach G, Boniecki M, Rother K, Bujnicki JM. MetalionRNA: computational predictor of metal-binding sites in RNA structures. ACTA ACUST UNITED AC 2011; 28:198-205. [PMID: 22110243 PMCID: PMC3259437 DOI: 10.1093/bioinformatics/btr636] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Motivation: Metal ions are essential for the folding of RNA molecules into stable tertiary structures and are often involved in the catalytic activity of ribozymes. However, the positions of metal ions in RNA 3D structures are difficult to determine experimentally. This motivated us to develop a computational predictor of metal ion sites for RNA structures. Results: We developed a statistical potential for predicting positions of metal ions (magnesium, sodium and potassium), based on the analysis of binding sites in experimentally solved RNA structures. The MetalionRNA program is available as a web server that predicts metal ions for RNA structures submitted by the user. Availability: The MetalionRNA web server is accessible at http://metalionrna.genesilico.pl/. Contact:iamb@genesilico.pl Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anna Philips
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology, ul. Ks. Trojdena 4, 02-109 Warsaw, Poland
| | | | | | | | | | | |
Collapse
|
8
|
Hamelryck T, Borg M, Paluszewski M, Paulsen J, Frellsen J, Andreetta C, Boomsma W, Bottaro S, Ferkinghoff-Borg J. Potentials of mean force for protein structure prediction vindicated, formalized and generalized. PLoS One 2010; 5:e13714. [PMID: 21103041 PMCID: PMC2978081 DOI: 10.1371/journal.pone.0013714] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2010] [Accepted: 10/04/2010] [Indexed: 11/26/2022] Open
Abstract
Understanding protein structure is of crucial importance in science, medicine and biotechnology. For about two decades, knowledge-based potentials based on pairwise distances – so-called “potentials of mean force” (PMFs) – have been center stage in the prediction and design of protein structure and the simulation of protein folding. However, the validity, scope and limitations of these potentials are still vigorously debated and disputed, and the optimal choice of the reference state – a necessary component of these potentials – is an unsolved problem. PMFs are loosely justified by analogy to the reversible work theorem in statistical physics, or by a statistical argument based on a likelihood function. Both justifications are insightful but leave many questions unanswered. Here, we show for the first time that PMFs can be seen as approximations to quantities that do have a rigorous probabilistic justification: they naturally arise when probability distributions over different features of proteins need to be combined. We call these quantities “reference ratio distributions” deriving from the application of the “reference ratio method.” This new view is not only of theoretical relevance but leads to many insights that are of direct practical use: the reference state is uniquely defined and does not require external physical insights; the approach can be generalized beyond pairwise distances to arbitrary features of protein structure; and it becomes clear for which purposes the use of these quantities is justified. We illustrate these insights with two applications, involving the radius of gyration and hydrogen bonding. In the latter case, we also show how the reference ratio method can be iteratively applied to sculpt an energy funnel. Our results considerably increase the understanding and scope of energy functions derived from known biomolecular structures.
Collapse
Affiliation(s)
- Thomas Hamelryck
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- * E-mail: (TH); (JFB)
| | - Mikael Borg
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Martin Paluszewski
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jonas Paulsen
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Jes Frellsen
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Christian Andreetta
- Bioinformatics Center, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Biomedical Engineering, Technical University of Denmark (DTU) Elektro, Technical University of Denmark, Lyngby, Denmark
- Department of Chemistry, University of Cambridge, Cambridge, United Kingdom
| | - Sandro Bottaro
- Biomedical Engineering, Technical University of Denmark (DTU) Elektro, Technical University of Denmark, Lyngby, Denmark
| | - Jesper Ferkinghoff-Borg
- Biomedical Engineering, Technical University of Denmark (DTU) Elektro, Technical University of Denmark, Lyngby, Denmark
- * E-mail: (TH); (JFB)
| |
Collapse
|
9
|
Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. Beyond rotamers: a generative, probabilistic model of side chains in proteins. BMC Bioinformatics 2010; 11:306. [PMID: 20525384 PMCID: PMC2902450 DOI: 10.1186/1471-2105-11-306] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2010] [Accepted: 06/05/2010] [Indexed: 11/21/2022] Open
Abstract
Background Accurately covering the conformational space of amino acid side chains is essential for important applications such as protein design, docking and high resolution structure prediction. Today, the most common way to capture this conformational space is through rotamer libraries - discrete collections of side chain conformations derived from experimentally determined protein structures. The discretization can be exploited to efficiently search the conformational space. However, discretizing this naturally continuous space comes at the cost of losing detailed information that is crucial for certain applications. For example, rigorously combining rotamers with physical force fields is associated with numerous problems. Results In this work we present BASILISK: a generative, probabilistic model of the conformational space of side chains that makes it possible to sample in continuous space. In addition, sampling can be conditional upon the protein's detailed backbone conformation, again in continuous space - without involving discretization. Conclusions A careful analysis of the model and a comparison with various rotamer libraries indicates that the model forms an excellent, fully continuous model of side chain conformational space. We also illustrate how the model can be used for rigorous, unbiased sampling with a physical force field, and how it improves side chain prediction when used as a pseudo-energy term. In conclusion, BASILISK is an important step forward on the way to a rigorous probabilistic description of protein structure in continuous space and in atomic detail.
Collapse
Affiliation(s)
- Tim Harder
- The Bioinformatics Section, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | | | | | | | | | | |
Collapse
|
10
|
Paluszewski M, Hamelryck T. Mocapy++--a toolkit for inference and learning in dynamic Bayesian networks. BMC Bioinformatics 2010; 11:126. [PMID: 20226024 PMCID: PMC2848649 DOI: 10.1186/1471-2105-11-126] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2009] [Accepted: 03/12/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Mocapy++ is a toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs). It supports a wide range of DBN architectures and probability distributions, including distributions from directional statistics (the statistics of angles, directions and orientations). RESULTS The program package is freely available under the GNU General Public Licence (GPL) from SourceForge http://sourceforge.net/projects/mocapy. The package contains the source for building the Mocapy++ library, several usage examples and the user manual. CONCLUSIONS Mocapy++ is especially suitable for constructing probabilistic models of biomolecular structure, due to its support for directional statistics. In particular, it supports the Kent distribution on the sphere and the bivariate von Mises distribution on the torus. These distributions have proven useful to formulate probabilistic models of protein and RNA structure in atomic detail.
Collapse
|