1
|
Mbogo I, Kawano C, Nakamura R, Tsuchiya Y, Villar-Briones A, Hirao Y, Yasuoka Y, Hayakawa E, Tomii K, Watanabe H. A transphyletic study of metazoan β-catenin protein complexes. ZOOLOGICAL LETTERS 2024; 10:20. [PMID: 39623505 PMCID: PMC11613877 DOI: 10.1186/s40851-024-00243-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 10/22/2024] [Indexed: 12/06/2024]
Abstract
Beta-catenin is essential for diverse biological processes, such as body axis determination and cell differentiation, during metazoan embryonic development. Beta-catenin is thought to exert such functions through complexes formed with various proteins. Although β-catenin complex proteins have been identified in several bilaterians, little is known about the structural and functional properties of β-catenin complexes in early metazoan evolution. In the present study, we performed a comparative analysis of β-catenin sequences in nonbilaterian lineages that diverged early in metazoan evolution. We also carried out transphyletic function experiments with β-catenin from nonbilaterian metazoans using developing Xenopus embryos, including secondary axis induction in embryos and proteomic analysis of β-catenin protein complexes. Comparative functional analysis of nonbilaterian β-catenins demonstrated sequence characteristics important for β-catenin functions, and the deep origin and evolutionary conservation of the cadherin-catenin complex. Proteins that co-immunoprecipitated with β-catenin included several proteins conserved among metazoans. These data provide new insights into the conserved repertoire of β-catenin complexes.
Collapse
Affiliation(s)
- Ivan Mbogo
- Evolutionary Neurobiology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
- Sysmex Corporation, Ltd. 1-5-1, Chuo-ku, Kobe, 651-0073, Japan
| | - Chihiro Kawano
- Evolutionary Neurobiology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
| | - Ryotaro Nakamura
- Evolutionary Neurobiology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
| | - Yuko Tsuchiya
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Alejandro Villar-Briones
- Instrumental Analysis Section, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
- Project Planning and Implementation Section, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
| | - Yoshitoshi Hirao
- Instrumental Analysis Section, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
| | - Yuuri Yasuoka
- Marine Genomics Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
- Laboratory for Comprehensive Genomic Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Eisuke Hayakawa
- Evolutionary Neurobiology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4, Kawazu, Iizuka, 820-8502, Fukuoka, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Hiroshi Watanabe
- Evolutionary Neurobiology Unit, Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan.
| |
Collapse
|
2
|
Imanbayeva A, Duisenova N, Orazov A, Sagyndykova M, Belozerov I, Tuyakova A. Study of the Floristic, Morphological, and Genetic (atpF-atpH, Internal Transcribed Spacer (ITS), matK, psbK-psbI, rbcL, and trnH-psbA) Differences in Crataegus ambigua Populations in Mangistau (Kazakhstan). PLANTS (BASEL, SWITZERLAND) 2024; 13:1591. [PMID: 38931023 PMCID: PMC11207986 DOI: 10.3390/plants13121591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 06/05/2024] [Accepted: 06/06/2024] [Indexed: 06/28/2024]
Abstract
This article studies the morphological parameters of vegetative and generative organs of different age groups of Crataegus ambigua from four populations in Western Karatau (Mangistau region, Kazakhstan). In this study, we examined four populations: Sultan Epe, Karakozaiym, Emdikorgan, and Samal, all located in various gorges of Western Karatau. Several phylogenetic inference methods were applied, using six genetic markers to reconstruct the evolutionary relationships between these populations: atpF-atpH, internal transcribed spacer (ITS), matK, psbK-psbI, rbcL, and trnH-psbA. We also used a statistical analysis of plants' vegetative and generative organs for three age groups (virgin, young, and adult generative). According to the age structure, Samal has a high concentration of young generative plants (42.3%) and adult generative plants (30.9%). Morphological analysis showed the significance of the parameters of the generative organs and separated the Samal population into a separate group according to the primary principal component analysis (PCoA) coordinates. The results of the floristic analysis showed that the Samal populations have a high concentration of species diversity. Comparative dendrograms using UPGMA (unweighted pair group method with arithmetic mean) showed that information gleaned from genetic markers and the psbK-psbI region can be used to determine the difference between the fourth Samal population and the other three.
Collapse
Affiliation(s)
| | | | - Aidyn Orazov
- Laboratory of Natural Flora and Dendrology, Mangyshlak Experimental Botanical Garden, Aktau 130000, Kazakhstan; (A.I.); (N.D.); (M.S.); (I.B.); (A.T.)
| | | | | | | |
Collapse
|
3
|
Islam S, Pantazes RJ. Developing similarity matrices for antibody-protein binding interactions. PLoS One 2023; 18:e0293606. [PMID: 37883504 PMCID: PMC10602319 DOI: 10.1371/journal.pone.0293606] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 10/17/2023] [Indexed: 10/28/2023] Open
Abstract
The inventions of AlphaFold and RoseTTAFold are revolutionizing computational protein science due to their abilities to reliably predict protein structures. Their unprecedented successes are due to the parallel consideration of several types of information, one of which is protein sequence similarity information. Sequence homology has been studied for many decades and depends on similarity matrices to define how similar or different protein sequences are to one another. A natural extension of predicting protein structures is predicting the interactions between proteins, but similarity matrices for protein-protein interactions do not exist. This study conducted a mutational analysis of 384 non-redundant antibody-protein antigen complexes to calculate antibody-protein interaction similarity matrices. Every important residue in each antibody and each antigen was mutated to each of the other 19 commonly occurring amino acids and the percentage changes in interaction energies were calculated using three force fields: CHARMM, Amber, and Rosetta. The data were used to construct six interaction similarity matrices, one for antibodies and another for antigens using each force field. The matrices exhibited both commonalities, such as mutations of aromatic and charged residues being the most detrimental, and differences, such as Rosetta predicting mutations of serines to be better tolerated than either Amber or CHARMM. A comparison to nine previously published similarity matrices for protein sequences revealed that the new interaction matrices are more similar to one another than they are to any of the previous matrices. The created similarity matrices can be used in force field specific applications to help guide decisions regarding mutations in protein-protein binding interfaces.
Collapse
Affiliation(s)
- Sumaiya Islam
- Department of Chemical Engineering, Auburn University, Auburn, Alabama, United States of America
| | - Robert J. Pantazes
- Department of Chemical Engineering, Auburn University, Auburn, Alabama, United States of America
| |
Collapse
|
4
|
Jia K, Kilinc M, Jernigan RL. New alignment method for remote protein sequences by the direct use of pairwise sequence correlations and substitutions. FRONTIERS IN BIOINFORMATICS 2023; 3:1227193. [PMID: 37900964 PMCID: PMC10602800 DOI: 10.3389/fbinf.2023.1227193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 08/14/2023] [Indexed: 10/31/2023] Open
Abstract
Understanding protein sequences and how they relate to the functions of proteins is extremely important. One of the most basic operations in bioinformatics is sequence alignment and usually the first things learned from these are which positions are the most conserved and often these are critical parts of the structure, such as enzyme active site residues. In addition, the contact pairs in a protein usually correspond closely to the correlations between residue positions in the multiple sequence alignment, and these usually change in a systematic and coordinated way, if one position changes then the other member of the pair also changes to compensate. In the present work, these correlated pairs are taken as anchor points for a new type of sequence alignment. The main advantage of the method here is its combining the remote homolog detection from our method PROST with pairwise sequence substitutions in the rigorous method from Kleinjung et al. We show a few examples of some resulting sequence alignments, and how they can lead to improvements in alignments for function, even for a disordered protein.
Collapse
Affiliation(s)
- Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
| | - Mesih Kilinc
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| | - Robert L. Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics, and Molecular Biology, Iowa State University, Ames, IA, United States
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA, United States
| |
Collapse
|
5
|
Caswell B, Summers TJ, Licup GL, Cantu DC. Mutation Space of Spatially Conserved Amino Acid Sites in Proteins. ACS OMEGA 2023; 8:24302-24310. [PMID: 37457482 PMCID: PMC10339398 DOI: 10.1021/acsomega.3c01473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Accepted: 06/14/2023] [Indexed: 07/18/2023]
Abstract
The mutation space of spatially conserved (MSSC) amino acid residues is a protein structural quantity developed and described in this work. The MSSC quantifies how many mutations and which different mutations, i.e., the mutation space, occur in each amino acid site in a protein. The MSSC calculates the mutation space of amino acids in a target protein from the spatially conserved residues in a group of multiple protein structures. Spatially conserved amino acid residues are identified based on their relative positions in the protein structure. The MSSC examines each residue in a target protein, compares it to the residues present in the same relative position in other protein structures, and uses physicochemical criteria of mutations found in each conserved spatial site to quantify the mutation space of each amino acid in the target protein. The MSSC is analogous to scoring each site in a multiple sequence alignment but in three-dimensional space considering the spatial location of residues instead of solely the order in which they appear in a protein sequence. MSSC analysis was performed on example cases, and it reproduces the well-known observation that, regardless of secondary structure, solvent-exposed residues are more likely to be mutated than internal ones. The MSSC code is available on GitHub: "https://github.com/Cantu-Research-Group/Mutation_Space".
Collapse
|
6
|
Chang CH, Nelson WC, Jerger A, Wright AT, Egbert RG, McDermott JE. Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding. BIOINFORMATICS ADVANCES 2023; 3:vbad005. [PMID: 36789294 PMCID: PMC9913046 DOI: 10.1093/bioadv/vbad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 12/16/2022] [Accepted: 02/01/2023] [Indexed: 02/04/2023]
Abstract
Motivation The vast expansion of sequence data generated from single organisms and microbiomes has precipitated the need for faster and more sensitive methods to assess evolutionary and functional relationships between proteins. Representing proteins as sets of short peptide sequences (kmers) has been used for rapid, accurate classification of proteins into functional categories; however, this approach employs an exact-match methodology and thus may be limited in terms of sensitivity and coverage. We have previously used similarity groupings, based on the chemical properties of amino acids, to form reduced character sets and recode proteins. This amino acid recoding (AAR) approach simplifies the construction of protein representations in the form of kmer vectors, which can link sequences with distant sequence similarity and provide accurate classification of problematic protein families. Results Here, we describe Snekmer, a software tool for recoding proteins into AAR kmer vectors and performing either (i) construction of supervised classification models trained on input protein families or (ii) clustering for de novo determination of protein families. We provide examples of the operation of the tool against a set of nitrogen cycling families originally collected using both standard hidden Markov models and a larger set of proteins from Uniprot and demonstrate that our method accurately differentiates these sequences in both operation modes. Availability and implementation Snekmer is written in Python using Snakemake. Code and data used in this article, along with tutorial notebooks, are available at http://github.com/PNNL-CompBio/Snekmer under an open-source BSD-3 license. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Christine H Chang
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - Abby Jerger
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | - Aaron T Wright
- Department of Biology, Baylor University, Waco, TX 76798, USA
| | - Robert G Egbert
- Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA
| | | |
Collapse
|
7
|
Aledo P, Aledo JC. Proteome-Wide Structural Computations Provide Insights into Empirical Amino Acid Substitution Matrices. Int J Mol Sci 2023; 24:ijms24010796. [PMID: 36614247 PMCID: PMC9821064 DOI: 10.3390/ijms24010796] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 12/24/2022] [Accepted: 12/29/2022] [Indexed: 01/04/2023] Open
Abstract
The relative contribution of mutation and selection to the amino acid substitution rates observed in empirical matrices is unclear. Herein, we present a neutral continuous fitness-stability model, inspired by the Arrhenius law (qij=aije-ΔΔGij). The model postulates that the rate of amino acid substitution (i→j) is determined by the product of a pre-exponential factor, which is influenced by the genetic code structure, and an exponential term reflecting the relative fitness of the amino acid substitutions. To assess the validity of our model, we computed changes in stability of 14,094 proteins, for which 137,073,638 in silico mutants were analyzed. These site-specific data were summarized into a 20 square matrix, whose entries, ΔΔGij, were obtained after averaging through all the sites in all the proteins. We found a significant positive correlation between these energy values and the disease-causing potential of each substitution, suggesting that the exponential term accurately summarizes the fitness effect. A remarkable observation was that amino acids that were highly destabilizing when acting as the source, tended to have little effect when acting as the destination, and vice versa (source → destination). The Arrhenius model accurately reproduced the pattern of substitution rates collected in the empirical matrices, suggesting a relevant role for the genetic code structure and a tuning role for purifying selection exerted via protein stability.
Collapse
|
8
|
Sumanaweera D, Allison L, Konagurthu AS. Bridging the gaps in statistical models of protein alignment. Bioinformatics 2022; 38:i229-i237. [PMID: 35758809 PMCID: PMC9235498 DOI: 10.1093/bioinformatics/btac246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Summary Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dinithi Sumanaweera
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - Lloyd Allison
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - Arun S Konagurthu
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
9
|
Paiva VA, Mendonça MV, Silveira SA, Ascher DB, Pires DEV, Izidoro SC. GASS-Metal: identifying metal-binding sites on protein structures using genetic algorithms. Brief Bioinform 2022; 23:6590153. [PMID: 35595534 DOI: 10.1093/bib/bbac178] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 04/18/2022] [Accepted: 04/20/2022] [Indexed: 12/12/2022] Open
Abstract
Metals are present in >30% of proteins found in nature and assist them to perform important biological functions, including storage, transport, signal transduction and enzymatic activity. Traditional and experimental techniques for metal-binding site prediction are usually costly and time-consuming, making computational tools that can assist in these predictions of significant importance. Here we present Genetic Active Site Search (GASS)-Metal, a new method for protein metal-binding site prediction. The method relies on a parallel genetic algorithm to find candidate metal-binding sites that are structurally similar to curated templates from M-CSA and MetalPDB. GASS-Metal was thoroughly validated using homologous proteins and conservative mutations of residues, showing a robust performance. The ability of GASS-Metal to identify metal-binding sites was also compared with state-of-the-art methods, outperforming similar methods and achieving an MCC of up to 0.57 and detecting up to 96.1% of the sites correctly. GASS-Metal is freely available at https://gassmetal.unifei.edu.br. The GASS-Metal source code is available at https://github.com/sandroizidoro/gassmetal-local.
Collapse
Affiliation(s)
- Vinícius A Paiva
- Department of Computer Science, Universidade Federal de Viçosa, Viçosa, Brazil
| | - Murillo V Mendonça
- Institute of Technological Sciences, Campus Theodomiro Carneiro Santiago, Universidade Federal de Itajubá, Itabira, Brazil
| | - Sabrina A Silveira
- Department of Computer Science, Universidade Federal de Viçosa, Viçosa, Brazil
| | - David B Ascher
- School of Chemistry and Molecular Biosciences, University of Queensland, St Lucia, Queensland, Australia.,Systems and Computational Biology, Bio21 Institute, University of Melbourne, Melbourne, Victoria, Australia.,Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Baker Department of Cardiometabolic Health, University of Melbourne, Melbourne, Victoria, Australia
| | - Douglas E V Pires
- Systems and Computational Biology, Bio21 Institute, University of Melbourne, Melbourne, Victoria, Australia.,Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - Sandro C Izidoro
- Institute of Technological Sciences, Campus Theodomiro Carneiro Santiago, Universidade Federal de Itajubá, Itabira, Brazil
| |
Collapse
|
10
|
Yamamori Y, Tomii K. Application of Homology Modeling by Enhanced Profile-Profile Alignment and Flexible-Fitting Simulation to Cryo-EM Based Structure Determination. Int J Mol Sci 2022; 23:1977. [PMID: 35216093 PMCID: PMC8879198 DOI: 10.3390/ijms23041977] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Revised: 02/07/2022] [Accepted: 02/09/2022] [Indexed: 12/03/2022] Open
Abstract
Application of cryo-electron microscopy (cryo-EM) is crucially important for ascertaining the atomic structure of large biomolecules such as ribosomes and protein complexes in membranes. Advances in cryo-EM technology and software have made it possible to obtain data with near-atomic resolution, but the method is still often capable of producing only a density map with up to medium resolution, either partially or entirely. Therefore, bridging the gap separating the density map and the atomic model is necessary. Herein, we propose a methodology for constructing atomic structure models based on cryo-EM maps with low-to-medium resolution. The method is a combination of sensitive and accurate homology modeling using our profile-profile alignment method with a flexible-fitting method using molecular dynamics simulation. As described herein, this study used benchmark applications to evaluate the model constructions of human two-pore channel 2 (one target protein in CASP13 with its structure determined using cryo-EM data) and the overall structure of Enterococcus hirae V-ATPase complex.
Collapse
Affiliation(s)
- Yu Yamamori
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan;
| | - Kentaro Tomii
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan;
- AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan
| |
Collapse
|
11
|
Jia K, Jernigan RL. New amino acid substitution matrix brings sequence alignments into agreement with structure matches. Proteins 2021; 89:671-682. [PMID: 33469973 PMCID: PMC8641535 DOI: 10.1002/prot.26050] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 01/08/2021] [Accepted: 01/12/2021] [Indexed: 12/27/2022]
Abstract
Protein sequence matching presently fails to identify many structures that are highly similar, even when they are known to have the same function. The high packing densities in globular proteins lead to interdependent substitutions, which have not previously been considered for amino acid similarities. At present, sequence matching compares sequences based only upon the similarities of single amino acids, ignoring the fact that in densely packed protein, there are additional conservative substitutions representing exchanges between two interacting amino acids, such as a small-large pair changing to a large-small pair substitutions that are not individually so conservative. Here we show that including information for such pairs of substitutions yields improved sequence matches, and that these yield significant gains in the agreements between sequence alignments and structure matches of the same protein pair. The result shows sequence segments matched where structure segments are aligned. There are gains for all 2002 collected cases where the sequence alignments that were not previously congruent with the structure matches. Our results also demonstrate a significant gain in detecting homology for “twilight zone” protein sequences. The amino acid substitution metrics derived have many other potential applications, for annotations, protein design, mutagenesis design, and empirical potential derivation.
Collapse
Affiliation(s)
- Kejue Jia
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, USA
| | - Robert L Jernigan
- Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa, USA
| |
Collapse
|
12
|
Saito-Nakano Y, Wahyuni R, Nakada-Tsukui K, Tomii K, Nozaki T. Rab7D small GTPase is involved in phago-, trogocytosis and cytoskeletal reorganization in the enteric protozoan Entamoeba histolytica. Cell Microbiol 2020; 23:e13267. [PMID: 32975360 PMCID: PMC7757265 DOI: 10.1111/cmi.13267] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 08/21/2020] [Accepted: 09/18/2020] [Indexed: 12/12/2022]
Abstract
Rab small GTPases regulate membrane traffic between distinct cellular compartments of all eukaryotes in a tempo‐spatially specific fashion. Rab small GTPases are also involved in the regulation of cytoskeleton and signalling. Membrane traffic and cytoskeletal regulation play pivotal role in the pathogenesis of Entamoeba histolytica, which is a protozoan parasite responsible for human amebiasis. E. histolytica is unique in that its genome encodes over 100 Rab proteins, containing multiple isotypes of conserved members (e.g., Rab7) and Entamoeba‐specific subgroups (e.g., RabA, B, and X). Among them, E. histolytica Rab7 is the most diversified group consisting of nine isotypes. While it was previously demonstrated that EhRab7A and EhRab7B are involved in lysosome and phagosome biogenesis, the individual roles of other Rab7 members and their coordination remain elusive. In this study, we characterised the third member of Rab7, Rab7D, to better understand the significance of the multiplicity of Rab7 isotypes in E. histolytica. Overexpression of EhRab7D caused reduction in phagocytosis of erythrocytes, trogocytosis (meaning nibbling or chewing of a portion) of live mammalian cells, and phagosome acidification and maturation. Conversely, transcriptional gene silencing of EhRab7D gene caused opposite phenotypes in phago/trogocytosis and phagosome maturation. Furthermore, EhRab7D gene silencing caused reduction in the attachment to and the motility on the collagen‐coated surface. Image analysis showed that EhRab7D was occasionally associated with lysosomes and prephagosomal vacuoles, but not with mature phagosomes and trogosomes. Finally, in silico prediction of structural organisation of EhRab7 isotypes identified unique amino acid changes on the effector binding surface of EhRab7D. Taken together, our data suggest that EhRab7D plays coordinated counteracting roles: a inhibitory role in phago/trogocytosis and lyso/phago/trogosome biogenesis, and an stimulatory role in adherence and motility, presumably via interaction with unique effectors. Finally, we propose the model in which three EhRab7 isotypes are sequentially involved in phago/trogocytosis.
Collapse
Affiliation(s)
- Yumiko Saito-Nakano
- Department of Parasitology, National Institute of Infectious Diseases, Tokyo, Japan
| | - Ratna Wahyuni
- Department of Biomedical Chemistry, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.,Institute of Tropical Disease, Universitas Airlangga, Surabaya, Indonesia.,Department of Health, Faculty of Vocational Studies, Universitas Airlangga, Surabaya, Indonesia
| | - Kumiko Nakada-Tsukui
- Department of Parasitology, National Institute of Infectious Diseases, Tokyo, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center (AIRC) and Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advance Industrial Science and Technology (AIST), Tokyo, Japan
| | - Tomoyoshi Nozaki
- Department of Biomedical Chemistry, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
13
|
Polyanovsky V, Lifanov A, Esipova N, Tumanyan V. The ranging of amino acids substitution matrices of various types in accordance with the alignment accuracy criterion. BMC Bioinformatics 2020; 21:294. [PMID: 32921315 PMCID: PMC7489204 DOI: 10.1186/s12859-020-03616-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Accepted: 06/18/2020] [Indexed: 11/15/2022] Open
Abstract
Background The alignment of character sequences is important in bioinformatics. The quality of this procedure is determined by the substitution matrix and parameters of the insertion-deletion penalty function. These matrices are derived from sequence alignment and thus reflect the evolutionary process. Currently, in addition to evolutionary matrices, a large number of different background matrices have been obtained. To make an optimal choice of the substitution matrix and the penalty parameters, we conducted a numerical experiment using a representative sample of existing matrices of various types and origins. Results We tested both the classical evolutionary matrix series (PAM, Blosum, VTML, Pfasum); structural alignment based matrices, contact energy matrix, and matrix based on the properties of the genetic code. This study presents results for two test set types: first, we simulated sequences that reflect the divergent evolution; second, we performed tests on Balibase sequences. In both cases, we obtained the dependences of the alignment quality (Accuracy, Confidence) on the evolutionary distance between sequences and the evolutionary distance to which the substitution matrices correspond. Optimization of a combination of matrices and the penalty parameters was carried out for local and global alignment on the values of penalty function parameters. Consequently, we found that the best alignment quality is achieved with matrices corresponding to the largest evolutionary distance. These matrices prove to be universal, i.e. suitable for aligning sequences separated by both large and small evolutionary distances. We analysed the correspondence of the correlation coefficients of matrices to the alignment quality. It was found that matrices showing high quality alignment have an above average correlation value, but the converse is not true. Conclusions This study showed that the best alignment quality is achieved with evolutionary matrices designed for long distances: Gonnet, VTML250, PAM250, MIQS, and Pfasum050. The same property is inherent in matrices not only of evolutionary origin, but also of another background corresponding to a large evolutionary distance. Therefore, matrices based on structural data show alignment quality close enough to its value for evolutionary matrices. This agrees with the idea that the spatial structure is more conservative than the protein sequence.
Collapse
|
14
|
Asnicar F, Thomas AM, Beghini F, Mengoni C, Manara S, Manghi P, Zhu Q, Bolzan M, Cumbo F, May U, Sanders JG, Zolfo M, Kopylova E, Pasolli E, Knight R, Mirarab S, Huttenhower C, Segata N. Precise phylogenetic analysis of microbial isolates and genomes from metagenomes using PhyloPhlAn 3.0. Nat Commun 2020; 11:2500. [PMID: 32427907 PMCID: PMC7237447 DOI: 10.1038/s41467-020-16366-7] [Citation(s) in RCA: 440] [Impact Index Per Article: 88.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Accepted: 04/27/2020] [Indexed: 01/10/2023] Open
Abstract
Microbial genomes are available at an ever-increasing pace, as cultivation and sequencing become cheaper and obtaining metagenome-assembled genomes (MAGs) becomes more effective. Phylogenetic placement methods to contextualize hundreds of thousands of genomes must thus be efficiently scalable and sensitive from closely related strains to divergent phyla. We present PhyloPhlAn 3.0, an accurate, rapid, and easy-to-use method for large-scale microbial genome characterization and phylogenetic analysis at multiple levels of resolution. PhyloPhlAn 3.0 can assign genomes from isolate sequencing or MAGs to species-level genome bins built from >230,000 publically available sequences. For individual clades of interest, it reconstructs strain-level phylogenies from among the closest species using clade-specific maximally informative markers. At the other extreme of resolution, it scales to large phylogenies comprising >17,000 microbial species. Examples including Staphylococcus aureus isolates, gut metagenomes, and meta-analyses demonstrate the ability of PhyloPhlAn 3.0 to support genomic and metagenomic analyses.
Collapse
Affiliation(s)
| | | | | | | | - Serena Manara
- Department CIBIO, University of Trento, Trento, Italy
| | - Paolo Manghi
- Department CIBIO, University of Trento, Trento, Italy
| | - Qiyun Zhu
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Mattia Bolzan
- Department CIBIO, University of Trento, Trento, Italy
- PreBiomics s.r.l, Trento, Italy
| | - Fabio Cumbo
- Department CIBIO, University of Trento, Trento, Italy
| | - Uyen May
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jon G Sanders
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Cornell Institute for Host-Microbe Interaction and Disease, Cornell University, Ithaca, NY, USA
| | - Moreno Zolfo
- Department CIBIO, University of Trento, Trento, Italy
| | - Evguenia Kopylova
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Clarity Genomics BVBA, Sint-Michielskaai 34, 2000, Antwerpen, Belgium
| | - Edoardo Pasolli
- Department CIBIO, University of Trento, Trento, Italy
- Department of Agricultural Sciences, University of Naples Federico II, Portici, Italy
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, CA, USA
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy.
| |
Collapse
|
15
|
Crim1 C140S mutant mice reveal the importance of cysteine 140 in the internal region 1 of CRIM1 for its physiological functions. Mamm Genome 2019; 30:329-338. [PMID: 31776724 DOI: 10.1007/s00335-019-09822-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2019] [Accepted: 11/20/2019] [Indexed: 10/25/2022]
Abstract
Cysteine-rich transmembrane bone morphogenetic protein regulator 1 (CRIM1) is a type I transmembrane protein involved in the organogenesis of many tissues via its interactions with growth factors including BMP, TGF-β, and VEGF. In this study, we used whole-exome sequencing and linkage analysis to identify a novel Crim1 mutant allele generated by ENU mutagenesis in mice. This allele is a missense mutation that causes a cysteine-to-serine substitution at position 140, and is referred to as Crim1C140S. In addition to the previously reported phenotypes in Crim1 mutants, Crim1C140S homozygous mice exhibited several novel phenotypes, including dwarfism, enlarged seminal vesicles, and rectal prolapse. In vitro analyses showed that Crim1C140S mutation affected the formation of CRIM1 complexes and decreased the amount of the overexpressed CRIM1 proteins in the cell culture supernatants. Cys140 is located in the internal region 1 (IR1) of the N-terminal extracellular region of CRIM1 and resides outside any identified functional domains. Inference of the domain architecture suggested that the Crim1C140S mutation disturbs an intramolecular disulfide bond in IR1, leading to the protein instability and the functional defects of CRIM1. Crim1C140S highlights the functional importance of the IR1, and Crim1C140S mice should serve as a valuable model for investigating the functions of CRIM1 that are unidentified as yet.
Collapse
|
16
|
Tomii K, Santos HJ, Nozaki T. Genome-Wide Analysis of Known and Potential Tetraspanins in Entamoeba histolytica. Genes (Basel) 2019; 10:genes10110885. [PMID: 31684194 PMCID: PMC6895871 DOI: 10.3390/genes10110885] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2019] [Revised: 10/25/2019] [Accepted: 10/31/2019] [Indexed: 12/12/2022] Open
Abstract
Tetraspanins are membrane proteins involved in intra- and/or intercellular signaling, and membrane protein complex formation. In some organisms, their role is associated with virulence and pathogenesis. Here, we investigate known and potential tetraspanins in the human intestinal protozoan parasite Entamoeba histolytica. We conducted sequence similarity searches against the proteome data of E. histolytica and newly identified nine uncharacterized proteins as potential tetraspanins in E. histolytica. We found three subgroups within known and potential tetraspanins, as well as subgroup-associated features in both their amino acid and nucleotide sequences. We also examined the subcellular localization of a few representative tetraspanins that might be potentially related to pathogenicity. The results in this study could be useful resources for further understanding and downstream analyses of tetraspanins in Entamoeba.
Collapse
Affiliation(s)
- Kentaro Tomii
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | - Herbert J Santos
- Department of Biomedical Chemistry, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan.
| | - Tomoyoshi Nozaki
- Department of Biomedical Chemistry, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan.
| |
Collapse
|
17
|
Actin Cytoskeletal Reorganization Function of JRAB/MICAL-L2 Is Fine-tuned by Intramolecular Interaction between First LIM Zinc Finger and C-terminal Coiled-coil Domains. Sci Rep 2019; 9:12794. [PMID: 31488862 PMCID: PMC6728388 DOI: 10.1038/s41598-019-49232-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2019] [Accepted: 08/21/2019] [Indexed: 01/01/2023] Open
Abstract
JRAB/MICAL-L2 is an effector protein of Rab13, a member of the Rab family of small GTPase. JRAB/MICAL-L2 consists of a calponin homology domain, a LIM domain, and a coiled-coil domain. JRAB/MICAL-L2 engages in intramolecular interaction between the N-terminal LIM domain and the C-terminal coiled-coil domain, and changes its conformation from closed to open under the effect of Rab13. Open-form JRAB/MICAL-L2 induces the formation of peripheral ruffles via an interaction between its calponin homology domain and filamin. Here, we report that the LIM domain, independent of the C-terminus, is also necessary for the function of open-form JRAB/MICAL-L2. In mechanistic terms, two zinc finger domains within the LIM domain bind the first and second molecules of actin at the minus end, potentially inhibiting the depolymerization of actin filaments (F-actin). The first zinc finger domain also contributes to the intramolecular interaction of JRAB/MICAL-L2. Moreover, the residues of the first zinc finger domain that are responsible for the intramolecular interaction are also involved in the association with F-actin. Together, our findings show that the function of open-form JRAB/MICAL-L2 mediated by the LIM domain is fine-tuned by the intramolecular interaction between the first zinc finger domain and the C-terminal domain.
Collapse
|
18
|
Yamada KD, Kinoshita K. De novo profile generation based on sequence context specificity with the long short-term memory network. BMC Bioinformatics 2018; 19:272. [PMID: 30021530 PMCID: PMC6052547 DOI: 10.1186/s12859-018-2284-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 07/11/2018] [Indexed: 11/24/2022] Open
Abstract
Background Long short-term memory (LSTM) is one of the most attractive deep learning methods to learn time series or contexts of input data. Increasing studies, including biological sequence analyses in bioinformatics, utilize this architecture. Amino acid sequence profiles are widely used for bioinformatics studies, such as sequence similarity searches, multiple alignments, and evolutionary analyses. Currently, many biological sequences are becoming available, and the rapidly increasing amount of sequence data emphasizes the importance of scalable generators of amino acid sequence profiles. Results We employed the LSTM network and developed a novel profile generator to construct profiles without any assumptions, except for input sequence context. Our method could generate better profiles than existing de novo profile generators, including CSBuild and RPS-BLAST, on the basis of profile-sequence similarity search performance with linear calculation costs against input sequence size. In addition, we analyzed the effects of the memory power of LSTM and found that LSTM had high potential power to detect long-range interactions between amino acids, as in the case of beta-strand formation, which has been a difficult problem in protein bioinformatics using sequence information. Conclusion We demonstrated the importance of sequence context and the feasibility of LSTM on biological sequence analyses. Our results demonstrated the effectiveness of memories in LSTM and showed that our de novo profile generator, SPBuild, achieved higher performance than that of existing methods for profile prediction of beta-strands, where long-range interactions of amino acids are important and are known to be difficult for the existing window-based prediction methods. Our findings will be useful for the development of other prediction methods related to biological sequences by machine learning methods. Electronic supplementary material The online version of this article (10.1186/s12859-018-2284-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kazunori D Yamada
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan.,Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Kengo Kinoshita
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan. .,Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan. .,Institute of Development, Aging, and Cancer, Tohoku University, Sendai, Japan.
| |
Collapse
|
19
|
Yamada KD. Derivative-free neural network for optimizing the scoring functions associated with dynamic programming of pairwise-profile alignment. Algorithms Mol Biol 2018; 13:5. [PMID: 29467815 PMCID: PMC5815186 DOI: 10.1186/s13015-018-0123-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2017] [Accepted: 02/06/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A profile-comparison method with position-specific scoring matrix (PSSM) is among the most accurate alignment methods. Currently, cosine similarity and correlation coefficients are used as scoring functions of dynamic programming to calculate similarity between PSSMs. However, it is unclear whether these functions are optimal for profile alignment methods. By definition, these functions cannot capture nonlinear relationships between profiles. Therefore, we attempted to discover a novel scoring function, which was more suitable for the profile-comparison method than existing functions, using neural networks. RESULTS Although neural networks required derivative-of-cost functions, the problem being addressed in this study lacked them. Therefore, we implemented a novel derivative-free neural network by combining a conventional neural network with an evolutionary strategy optimization method used as a solver. Using this novel neural network system, we optimized the scoring function to align remote sequence pairs. Our results showed that the pairwise-profile aligner using the novel scoring function significantly improved both alignment sensitivity and precision relative to aligners using existing functions. CONCLUSIONS We developed and implemented a novel derivative-free neural network and aligner (Nepal) for optimizing sequence alignments. Nepal improved alignment quality by adapting to remote sequence alignments and increasing the expressiveness of similarity scores. Additionally, this novel scoring function can be realized using a simple matrix operation and easily incorporated into other aligners. Moreover our scoring function could potentially improve the performance of homology detection and/or multiple-sequence alignment of remote homologous sequences. The goal of the study was to provide a novel scoring function for profile alignment method and develop a novel learning system capable of addressing derivative-free problems. Our system is capable of optimizing the performance of other sophisticated methods and solving problems without derivative-of-cost functions, which do not always exist in practical problems. Our results demonstrated the usefulness of this optimization method for derivative-free problems.
Collapse
|
20
|
Nojoomi S, Koehl P. A weighted string kernel for protein fold recognition. BMC Bioinformatics 2017; 18:378. [PMID: 28841820 PMCID: PMC5574112 DOI: 10.1186/s12859-017-1795-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2017] [Accepted: 08/15/2017] [Indexed: 11/10/2022] Open
Abstract
Background Alignment-free methods for comparing protein sequences have proved to be viable alternatives to approaches that first rely on an alignment of the sequences to be compared. Much work however need to be done before those methods provide reliable fold recognition for proteins whose sequences share little similarity. We have recently proposed an alignment-free method based on the concept of string kernels, SeqKernel (Nojoomi and Koehl, BMC Bioinformatics, 2017, 18:137). In this previous study, we have shown that while Seqkernel performs better than standard alignment-based methods, its applications are potentially limited, because of biases due mostly to sequence length effects. Methods In this study, we propose improvements to SeqKernel that follows two directions. First, we developed a weighted version of the kernel, WSeqKernel. Second, we expand the concept of string kernels into a novel framework for deriving information on amino acids from protein sequences. Results Using a dataset that only contains remote homologs, we have shown that WSeqKernel performs remarkably well in fold recognition experiments. We have shown that with the appropriate weighting scheme, we can remove the length effects on the kernel values. WSeqKernel, just like any alignment-based sequence comparison method, depends on a substitution matrix. We have shown that this matrix can be optimized so that sequence similarity scores correlate well with structure similarity scores. Starting from no information on amino acid similarity, we have shown that we can derive a scoring matrix that echoes the physico-chemical properties of amino acids. Conclusion We have made progress in characterizing and parametrizing string kernels as alignment-based methods for comparing protein sequences, and we have shown that they provide a framework for extracting sequence information from structure. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1795-5) contains supplementary material, which is available to authorized users.
Collapse
|
21
|
Barlowe S, Coan HB, Youker RT. SubVis: an interactive R package for exploring the effects of multiple substitution matrices on pairwise sequence alignment. PeerJ 2017; 5:e3492. [PMID: 28674656 PMCID: PMC5490468 DOI: 10.7717/peerj.3492] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 05/27/2017] [Indexed: 01/13/2023] Open
Abstract
Understanding how proteins mutate is critical to solving a host of biological problems. Mutations occur when an amino acid is substituted for another in a protein sequence. The set of likelihoods for amino acid substitutions is stored in a matrix and input to alignment algorithms. The quality of the resulting alignment is used to assess the similarity of two or more sequences and can vary according to assumptions modeled by the substitution matrix. Substitution strategies with minor parameter variations are often grouped together in families. For example, the BLOSUM and PAM matrix families are commonly used because they provide a standard, predefined way of modeling substitutions. However, researchers often do not know if a given matrix family or any individual matrix within a family is the most suitable. Furthermore, predefined matrix families may inaccurately reflect a particular hypothesis that a researcher wishes to model or otherwise result in unsatisfactory alignments. In these cases, the ability to compare the effects of one or more custom matrices may be needed. This laborious process is often performed manually because the ability to simultaneously load multiple matrices and then compare their effects on alignments is not readily available in current software tools. This paper presents SubVis, an interactive R package for loading and applying multiple substitution matrices to pairwise alignments. Users can simultaneously explore alignments resulting from multiple predefined and custom substitution matrices. SubVis utilizes several of the alignment functions found in R, a common language among protein scientists. Functions are tied together with the Shiny platform which allows the modification of input parameters. Information regarding alignment quality and individual amino acid substitutions is displayed with the JavaScript language which provides interactive visualizations for revealing both high-level and low-level alignment information.
Collapse
Affiliation(s)
- Scott Barlowe
- Department of Mathematics and Computer Science, Western Carolina University, Cullowhee, NC, United States of America
| | - Heather B Coan
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| | - Robert T Youker
- Department of Biology, Western Carolina University, Cullowhee, NC, United States of America
| |
Collapse
|
22
|
Oda T, Lim K, Tomii K. Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance. BMC Bioinformatics 2017; 18:288. [PMID: 28578660 PMCID: PMC5455086 DOI: 10.1186/s12859-017-1686-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2017] [Accepted: 05/15/2017] [Indexed: 11/13/2022] Open
Abstract
Background PSI-BLAST, an extremely popular tool for sequence similarity search, features the utilization of Position-Specific Scoring Matrix (PSSM) constructed from a multiple sequence alignment (MSA). PSSM allows the detection of more distant homologs than a general amino acid substitution matrix does. An accurate estimation of the weights for sequences in an MSA is crucially important for PSSM construction. PSI-BLAST divides a given MSA into multiple blocks, for which sequence weights are calculated. When the block width becomes very narrow, the sequence weight calculation can be odd. Results We demonstrate that PSI-BLAST indeed generates a significant fraction of blocks having width less than 5, thereby degrading the PSI-BLAST performance. We revised the code of PSI-BLAST to prevent the blocks from being narrower than a given minimum block width (MBW). We designate the modified application of PSI-BLAST as PSI-BLASTexB. When MBW is 25, PSI-BLASTexB notably outperforms PSI-BLAST consistently for three independent benchmark sets. The performance boost is even more drastic when an MSA, instead of a sequence, is used as a query. Conclusions Our results demonstrate that the generation of narrow-width blocks during the sequence weight calculation is a critically important factor that restricts the PSI-BLAST search performance. By preventing narrow blocks, PSI-BLASTexB upgrades the PSI-BLAST performance remarkably. Binaries and source codes of PSI-BLASTexB (MBW = 25) are available at https://github.com/kyungtaekLIM/PSI-BLASTexB. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1686-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Toshiyuki Oda
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| | - Kyungtaek Lim
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| |
Collapse
|
23
|
Lim K, Yamada KD, Frith MC, Tomii K. Protein sequence-similarity search acceleration using a heuristic algorithm with a sensitive matrix. ACTA ACUST UNITED AC 2017; 17:147-154. [PMID: 28083762 PMCID: PMC5274646 DOI: 10.1007/s10969-016-9210-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2015] [Accepted: 12/05/2016] [Indexed: 12/28/2022]
Abstract
Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 105 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.
Collapse
Affiliation(s)
- Kyungtaek Lim
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
| | - Kazunori D Yamada
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Graduate School of Information Sciences, Tohoku University, 6-3-9 Aramaki-Aza-Aoba, Aoba-ku, Sendai, 980-8579, Japan
| | - Martin C Frith
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan
- Department of Computational Biology and Medical Sciences, University of Tokyo, 5-1-5 Kashiwa-no-ha, Kashiwa, Chiba, 227-8561, Japan
| | - Kentaro Tomii
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
- Biotechnology Research Institute for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064, Japan.
| |
Collapse
|
24
|
Deorowicz S, Debudaj-Grabysz A, Gudyś A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep 2016; 6:33964. [PMID: 27670777 PMCID: PMC5037421 DOI: 10.1038/srep33964] [Citation(s) in RCA: 93] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2016] [Accepted: 08/31/2016] [Indexed: 11/10/2022] Open
Abstract
Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | | | - Adam Gudyś
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| |
Collapse
|
25
|
Leelananda SP, Kloczkowski A, Jernigan RL. Fold-specific sequence scoring improves protein sequence matching. BMC Bioinformatics 2016; 17:328. [PMID: 27578239 PMCID: PMC5006591 DOI: 10.1186/s12859-016-1198-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 08/24/2016] [Indexed: 11/10/2022] Open
Abstract
Background Sequence matching is extremely important for applications throughout biology, particularly for discovering information such as functional and evolutionary relationships, and also for discriminating between unimportant and disease mutants. At present the functions of a large fraction of genes are unknown; improvements in sequence matching will improve gene annotations. Universal amino acid substitution matrices such as Blosum62 are used to measure sequence similarities and to identify distant homologues, regardless of the structure class. However, such single matrices do not take into account important structural information evident within the different topologies of proteins and treats substitutions within all protein folds identically. Others have suggested that the use of structural information can lead to significant improvements in sequence matching but this has not yet been very effective. Here we develop novel substitution matrices that include not only general sequence information but also have a topology specific component that is unique for each CATH topology. This novel feature of using a combination of sequence and structure information for each protein topology significantly improves the sequence matching scores for the sequence pairs tested. We have used a novel multi-structure alignment method for each homology level of CATH in order to extract topological information. Results We obtain statistically significant improved sequence matching scores for 73 % of the alpha helical test cases. On average, 61 % of the test cases showed improvements in homology detection when structure information was incorporated into the substitution matrices. On average z-scores for homology detection are improved by more than 54 % for all cases, and some individual cases have z-scores more than twice those obtained using generic matrices. Our topology specific similarity matrices also outperform other traditional similarity matrices and single matrix based structure methods. When default amino acid substitution matrix in the Psi-blast algorithm is replaced by our structure-based matrices, the structure matching is significantly improved over conventional Psi-blast. It also outperforms results obtained for the corresponding HMM profiles generated for each topology. Conclusions We show that by incorporating topology-specific structure information in addition to sequence information into specific amino acid substitution matrices, the sequence matching scores and homology detection are significantly improved. Our topology specific similarity matrices outperform other traditional similarity matrices, single matrix based structure methods, also show improvement over conventional Psi-blast and HMM profile based methods in sequence matching. The results support the discriminatory ability of the new amino acid similarity matrices to distinguish between distant homologs and structurally dissimilar pairs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1198-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sumudu P Leelananda
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Present Address: 2120 Newman and Wolfrom Laboratory, The Ohio State University, 100 W 18th Ave, Columbus, OH, 43210, USA.,Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Andrzej Kloczkowski
- Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA.,Present Address: Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA
| | - Robert L Jernigan
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA. .,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.
| |
Collapse
|
26
|
Systematic Exploration of an Efficient Amino Acid Substitution Matrix: MIQS. Methods Mol Biol 2016. [PMID: 27115635 DOI: 10.1007/978-1-4939-3572-7_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Amino acid sequence comparisons to find similarities between proteins are fundamental sequence information analyses for inferring protein structure and function. In this study, we improve amino acid substitution matrices to identify distantly related proteins. We systematically sampled and benchmarked substitution matrices generated from the principal component analysis (PCA) subspace based on a set of typical existing matrices. Based on the benchmark results, we identified a region of highly sensitive matrices in the PCA subspace using kernel density estimation (KDE). Using the PCA subspace, we were able to deduce a novel sensitive matrix, called MIQS, which shows better detection performance for detecting distantly related proteins than those of existing matrices. This approach to derive an efficient amino acid substitution matrix might influence many fields of protein sequence analysis. MIQS is available at http://csas.cbrc.jp/Ssearch/ .
Collapse
|
27
|
Katoh K, Standley DM. A simple method to control over-alignment in the MAFFT multiple sequence alignment program. Bioinformatics 2016; 32:1933-42. [PMID: 27153688 PMCID: PMC4920119 DOI: 10.1093/bioinformatics/btw108] [Citation(s) in RCA: 360] [Impact Index Per Article: 40.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 02/19/2016] [Indexed: 12/17/2022] Open
Abstract
Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact:katoh@ifrec.osaka-u.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kazutaka Katoh
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan
| | - Daron M Standley
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan Institute for Virus Research, Kyoto University, Kyoto 606-8507, Japan
| |
Collapse
|
28
|
Oh Brother, Where Art Thou? Finding Orthologs in the Twilight and Midnight Zones of Sequence Similarity. Evol Biol 2016. [DOI: 10.1007/978-3-319-41324-2_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
29
|
Sheetlin S, Park Y, Frith MC, Spouge JL. ALP & FALP: C++ libraries for pairwise local alignment E-values. Bioinformatics 2015; 32:304-5. [PMID: 26428291 DOI: 10.1093/bioinformatics/btv575] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 09/28/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Pairwise local alignment is an indispensable tool for molecular biologists. In real time (i.e. in about 1 s), ALP (Ascending Ladder Program) calculates the E-values for protein-protein or DNA-DNA local alignments of random sequences, for arbitrary substitution score matrix, gap costs and letter abundances; and FALP (Frameshift Ascending Ladder Program) performs a similar task, although more slowly, for frameshifting DNA-protein alignments. AVAILABILITY AND IMPLEMENTATION To permit other C++ programmers to implement the computational efficiencies in ALP and FALP directly within their own programs, C++ source codes are available in the public domain at http://go.usa.gov/3GTSW under 'ALP' and 'FALP', along with the standalone programs ALP and FALP. CONTACT spouge@nih.gov SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sergey Sheetlin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Yonil Park
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| | - Martin C Frith
- Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan
| | - John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, MD 20894, USA and
| |
Collapse
|
30
|
Ndhlovu A, Hazelhurst S, Durand PM. Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix. BMC Bioinformatics 2015; 16:255. [PMID: 26269100 PMCID: PMC4535666 DOI: 10.1186/s12859-015-0688-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2015] [Accepted: 07/29/2015] [Indexed: 11/27/2022] Open
Abstract
Background Selective pressures at the DNA level shape genes into profiles consisting of patterns of rapidly evolving sites and sites withstanding change. These profiles remain detectable even when protein sequences become extensively diverged. A common task in molecular biology is to infer functional, structural or evolutionary relationships by querying a database using an algorithm. However, problems arise when sequence similarity is low. This study presents an algorithm that uses the evolutionary rate at codon sites, the dN/dS (ω) parameter, coupled to a substitution matrix as an alignment metric for detecting distantly related proteins. The algorithm, called BLOSUM-FIRE couples a newer and improved version of the original FIRE (Functional Inference using Rates of Evolution) algorithm with an amino acid substitution matrix in a dynamic scoring function. The enigmatic hepatitis B virus X protein was used as a test case for BLOSUM-FIRE and its associated database EvoDB. Results The evolutionary rate based approach was coupled with a conventional BLOSUM substitution matrix. The two approaches are combined in a dynamic scoring function, which uses the selective pressure to score aligned residues. The dynamic scoring function is based on a coupled additive approach that scores aligned sites based on the level of conservation inferred from the ω values. Evaluation of the accuracy of this new implementation, BLOSUM-FIRE, using MAFFT alignment as reference alignments has shown that it is more accurate than its predecessor FIRE. Comparison of the alignment quality with widely used algorithms (MUSCLE, T-COFFEE, and CLUSTAL Omega) revealed that the BLOSUM-FIRE algorithm performs as well as conventional algorithms. Its main strength lies in that it provides greater potential for aligning divergent sequences and addresses the problem of low specificity inherent in the original FIRE algorithm. The utility of this algorithm is demonstrated using the Hepatitis B virus X (HBx) protein, a protein of unknown function, as a test case. Conclusion This study describes the utility of an evolutionary rate based approach coupled to the BLOSUM62 amino acid substitution matrix in inferring protein domain function. We demonstrate that such an approach is robust and performs as well as an array of conventional algorithms.
Collapse
Affiliation(s)
- Andrew Ndhlovu
- Evolutionary Medicine Laboratory, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa. .,Sydney Brenner Institute of Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa.
| | - Scott Hazelhurst
- School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, South Africa. .,Sydney Brenner Institute of Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa.
| | - Pierre M Durand
- Evolutionary Medicine Laboratory, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa. .,Sydney Brenner Institute of Molecular Bioscience, University of the Witwatersrand, Johannesburg, South Africa. .,Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ, 85721, USA. .,Department of Biodiversity and Conservation Biology, Faculty of Natural Sciences, University of the Western Cape, Private Bag X17, Belville, Cape Town, 7530, South Africa.
| |
Collapse
|
31
|
Izidoro SC, de Melo-Minardi RC, Pappa GL. GASS: identifying enzyme active sites with genetic algorithms. ACTA ACUST UNITED AC 2014; 31:864-70. [PMID: 25388152 DOI: 10.1093/bioinformatics/btu746] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
MOTIVATION Currently, 25% of proteins annotated in Pfam have their function unknown. One way of predicting proteins function is by looking at their active site, which has two main parts: the catalytic site and the substrate binding site. The active site is more conserved than the other residues of the protein and can be a rich source of information for protein function prediction. This article presents a new heuristic method, named genetic active site search (GASS), which searches for given active site 3D templates in unknown proteins. The method can perform non-exact amino acid matches (conservative mutations), is able to find amino acids in different chains and does not impose any restrictions on the active site size. RESULTS GASS results were compared with those catalogued in the catalytic site atlas (CSA) in four different datasets and compared with two other methods: amino acid pattern search for substructures and motif and catalytic site identification. The results show GASS can correctly identify >90% of the templates searched. Experiments were also run using data from the substrate binding sites prediction competition CASP 10, and GASS is ranked fourth among the 18 methods considered.
Collapse
Affiliation(s)
- Sandro C Izidoro
- Advanced Campus at Itabira, Universidade Federal de Itajubá, Itajubá, MG 35903-087, Brazil and Department of Computer Science and Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270-901, Brazil
| | - Raquel C de Melo-Minardi
- Advanced Campus at Itabira, Universidade Federal de Itajubá, Itajubá, MG 35903-087, Brazil and Department of Computer Science and Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270-901, Brazil Advanced Campus at Itabira, Universidade Federal de Itajubá, Itajubá, MG 35903-087, Brazil and Department of Computer Science and Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270-901, Brazil
| | - Gisele L Pappa
- Advanced Campus at Itabira, Universidade Federal de Itajubá, Itajubá, MG 35903-087, Brazil and Department of Computer Science and Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270-901, Brazil Advanced Campus at Itabira, Universidade Federal de Itajubá, Itajubá, MG 35903-087, Brazil and Department of Computer Science and Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, MG 31270-901, Brazil
| |
Collapse
|
32
|
Wong PS, Tanaka M, Sunaga Y, Tanaka M, Taniguchi T, Yoshino T, Tanaka T, Fujibuchi W, Aburatani S. Tracking difference in gene expression in a time-course experiment using gene set enrichment analysis. PLoS One 2014; 9:e107629. [PMID: 25268590 PMCID: PMC4182424 DOI: 10.1371/journal.pone.0107629] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2013] [Accepted: 08/21/2014] [Indexed: 11/19/2022] Open
Abstract
Fistulifera sp. strain JPCC DA0580 is a newly sequenced pennate diatom that is capable of simultaneously growing and accumulating lipids. This is a unique trait, not found in other related microalgae so far. It is able to accumulate between 40 to 60% of its cell weight in lipids, making it a strong candidate for the production of biofuel. To investigate this characteristic, we used RNA-Seq data gathered at four different times while Fistulifera sp. strain JPCC DA0580 was grown in oil accumulating and non-oil accumulating conditions. We then adapted gene set enrichment analysis (GSEA) to investigate the relationship between the difference in gene expression of 7,822 genes and metabolic functions in our data. We utilized information in the KEGG pathway database to create the gene sets and changed GSEA to use re-sampling so that data from the different time points could be included in the analysis. Our GSEA method identified photosynthesis, lipid synthesis and amino acid synthesis related pathways as processes that play a significant role in oil production and growth in Fistulifera sp. strain JPCC DA0580. In addition to GSEA, we visualized the results by creating a network of compounds and reactions, and plotted the expression data on top of the network. This made existing graph algorithms available to us which we then used to calculate a path that metabolizes glucose into triacylglycerol (TAG) in the smallest number of steps. By visualizing the data this way, we observed a separate up-regulation of genes at different times instead of a concerted response. We also identified two metabolic paths that used less reactions than the one shown in KEGG and showed that the reactions were up-regulated during the experiment. The combination of analysis and visualization methods successfully analyzed time-course data, identified important metabolic pathways and provided new hypotheses for further research.
Collapse
Affiliation(s)
- Pui Shan Wong
- CBRC, National Institute of AIST, Tokyo, Japan
- * E-mail:
| | - Michihiro Tanaka
- Center for iPS Research and Application, Kyoto University, Kyoto, Japan
| | - Yoshihiko Sunaga
- Institute of Engineering, Tokyo University of Agriculture and Technology, Tokyo, Japan
- JST, CREST, Sanbancho 5, Chiyoda-ku, Tokyo, Japan
| | - Masayoshi Tanaka
- Institute of Engineering, Tokyo University of Agriculture and Technology, Tokyo, Japan
| | | | - Tomoko Yoshino
- Institute of Engineering, Tokyo University of Agriculture and Technology, Tokyo, Japan
- JST, CREST, Sanbancho 5, Chiyoda-ku, Tokyo, Japan
| | - Tsuyoshi Tanaka
- Institute of Engineering, Tokyo University of Agriculture and Technology, Tokyo, Japan
- JST, CREST, Sanbancho 5, Chiyoda-ku, Tokyo, Japan
| | - Wataru Fujibuchi
- CBRC, National Institute of AIST, Tokyo, Japan
- Center for iPS Research and Application, Kyoto University, Kyoto, Japan
| | | |
Collapse
|