1
|
Rahman ASMZ, Timmerman L, Gallardo F, Cardona ST. Identification of putative essential protein domains from high-density transposon insertion sequencing. Sci Rep 2022; 12:962. [PMID: 35046497 PMCID: PMC8770471 DOI: 10.1038/s41598-022-05028-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 12/29/2021] [Indexed: 12/24/2022] Open
Abstract
A first clue to gene function can be obtained by examining whether a gene is required for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential genes are usually identified by high-density transposon mutagenesis followed by sequencing of insertion sites (Tn-seq). These studies assign the term "essential" to whole genes rather than the protein domain sequences that encode the essential functions. However, genes can code for multiple protein domains that evolve their functions independently. Therefore, when essential genes code for more than one protein domain, only one of them could be essential. In this study, we defined this subset of genes as "essential domain-containing" (EDC) genes. Using a Tn-seq data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify EDC genes and the essential protein domains they encode. We found forty candidate EDC genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). This analysis included two knockdowns of genes encoding the protein domains of unknown function DUF2213 and DUF4148. These putative essential domains are conserved in more than two hundred bacterial species, including human and plant pathogens. Together, our study suggests that essentiality should be assigned to individual protein domains rather than genes, contributing to a first functional characterization of protein domains of unknown function.
Collapse
Affiliation(s)
| | - Lukas Timmerman
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Flyn Gallardo
- Department of Microbiology, University of Manitoba, Winnipeg, MB, Canada
| | - Silvia T Cardona
- Department of Microbiology, University of Manitoba, Winnipeg, MB, Canada.
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, Canada.
| |
Collapse
|
2
|
Anand P, Pandey JP, Pandey DM. Study on cocoonase, sericin, and degumming of silk cocoon: computational and experimental. J Genet Eng Biotechnol 2021; 19:32. [PMID: 33594479 PMCID: PMC7886927 DOI: 10.1186/s43141-021-00125-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Accepted: 01/25/2021] [Indexed: 02/07/2023]
Abstract
Background Cocoonase is a proteolytic enzyme that helps in dissolving the silk cocoon shell and exit of silk moth. Chemicals like anhydrous Na2CO3, Marseille soap, soda, ethylene diamine and tartaric acid-based degumming of silk cocoon shell have been in practice. During this process, solubility of sericin protein increased resulting in the release of sericin from the fibroin protein of the silk. However, this process diminishes natural color and softness of the silk. Cocoonase enzyme digests the sericin protein of silk at the anterior portion of the cocoon without disturbing the silk fibroin. However, no thorough characterization of cocoonase and sericin protein as well as imaging analysis of chemical- and enzyme-treated silk sheets has been carried out so far. Therefore, present study aimed for detailed characterization of cocoonase and sericin proteins, phylogenetic analysis, secondary and tertiary structure prediction, and computational validation as well as their interaction with other proteins. Further, identification of tasar silkworm (Antheraea mylitta) pupa stage for cocoonase collection, its purification and effect on silk sheet degumming, scanning electron microscope (SEM)-based comparison of chemical- and enzyme-treated cocoon sheets, and its optical coherence tomography (OCT)-based imaging analysis have been investigated. Various computational tools like Molecular Evolutionary Genetics Analysis (MEGA) X and Figtree, Iterative Threading Assembly Refinement (I-TASSER), self-optimized predicted method with alignment (SOPMA), PROCHECK, University of California, San Francisco (UCSF) Chimera, and Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) were used for characterization of cocoonase and sericin proteins. Sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE), protein purification using Sephadex G 25-column, degumming of cocoon sheet using cocoonase enzyme and chemical Na2CO3, and SEM and OCT analysis of degummed cocoon sheet were performed. Results Predicted normalized B-factors of cocoonase and sericin with respect to α and β regions showed that these regions are structurally more stable in cocoonase while less stable in sericin. Conserved domain analysis revealed that B. mori cocoonase contains a trypsin-like serine protease with active site range 45 to 180 query sequences while substrate binding site from 175 to 200 query sequences. SDS-PAGE analysis of cocoonase indicated its molecular weight of 25–26 kDa. Na2CO3 treatment showed more degumming effect (i.e., cocoon sheet weight loss) as compared to degumming with cocoonase. However, cocoonase-treated silk cocoon sheet holds the natural color of tasar silk, smoothness, and luster compared with the cocoon sheet treated with Na2CO3. SEM-based analysis showed the noticeable variation on the surface of silk fiber treated with cocoonase and Na2CO3. OCT analysis also exemplified the variations in the cross-sectional view of the cocoonase and Na2CO3-treated silk sheets. Conclusions Present study enlightens on the detailed characteristics of cocoonase and sericin proteins, comparative degumming activity, and image analysis of cocoonase enzyme and Na2CO3 chemical-treated silk sheets. Obtained findings illustrated about use of cocoonase enzyme in the degumming of silk cocoon at larger scale that will be a boon to the silk industry. Supplementary Information The online version contains supplementary material available at 10.1186/s43141-021-00125-2.
Collapse
Affiliation(s)
- Preeti Anand
- Department of Bio-Engineering, Birla Institute of Technology, Mesra, Ranchi, Jharkhand, 835215, India
| | - Jay Prakash Pandey
- Central Tasar Research and Training Institute, Piska- nagri, Jharkhand, Ranchi, India
| | - Dev Mani Pandey
- Department of Bio-Engineering, Birla Institute of Technology, Mesra, Ranchi, Jharkhand, 835215, India.
| |
Collapse
|
3
|
Hu L, Yang S. A fast algorithm to identify coevolutionary patterns from protein sequences based on tree-based data structure. 2019 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS (SMC) 2019:2273-2278. [DOI: 10.1109/smc.2019.8914527] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
4
|
Kotlyar M, Rossos AEM, Jurisica I. Prediction of Protein-Protein Interactions. ACTA ACUST UNITED AC 2017; 60:8.2.1-8.2.14. [PMID: 29220074 DOI: 10.1002/cpbi.38] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The authors provide an overview of physical protein-protein interaction prediction, covering the main strategies for predicting interactions, approaches for assessing predictions, and online resources for accessing predictions. This unit focuses on the main advancements in each of these areas over the last decade. The methods and resources that are presented here are not an exhaustive set, but characterize the current state of the field-highlighting key challenges and achievements. © 2017 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Max Kotlyar
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Andrea E M Rossos
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Igor Jurisica
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada.,Departments of Medical Biophysics and Computer Science, University of Toronto, Ontario, Canada.,Institute of Neuroimmunology, Slovak Academy of Sciences, Bratislava, Slovakia
| |
Collapse
|
5
|
Hu L, Chan KCC. Extracting Coevolutionary Features from Protein Sequences for Predicting Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:155-166. [PMID: 26812730 DOI: 10.1109/tcbb.2016.2520923] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Knowing the ways proteins interact with each other are crucial to our understanding of the functional mechanisms of proteins. It is for this reason that different approaches have been developed in attempts to predict protein-protein interactions (PPIs) computationally. Among them, the sequence-based approaches are preferred to the others as they do not require any information about protein properties to perform their tasks. Instead, most sequence-based approaches make use of feature extraction methods to extract features directly from protein sequences so that for each protein sequence, we can construct a feature vector. The feature vectors of every pair of proteins are then concatenated to form two classes of interacting and non-interacting proteins. The prediction of whether or not two proteins interact with each other is then formulated as a classification problem. How accurate PPI predictions can be made therefore depends on how good the features are that can be extracted from the protein sequences to allow interacting or non-interacting to be best distinguished. To do so, instead of extracting such features from individual protein sequences independently of the other protein in the same pair, we propose to jointly consider features from both sequences in a protein pair during the feature extraction process through using a novel coevolutionary feature extraction approach called CoFex. Coevolutionary features extracted by CoFex refer to the covariations found at coevolving positions. Based on the presence and absence of these coevolutionary features in the sequences of two proteins, feature vectors can be composed for pairs of proteins rather than individual proteins. The experiment results show that CoFex is a promising feature extraction approach and can improve the performance of PPI prediction.
Collapse
|
6
|
Hu L, Chan KCC. Discovering Variable-Length Patterns in Protein Sequences for Protein-Protein Interaction Prediction. IEEE Trans Nanobioscience 2015; 14:409-416. [DOI: 10.1109/tnb.2015.2429672] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
7
|
Lu Y, Lu Y, Deng J, Peng H, Lu H, Lu LJ. A novel essential domain perspective for exploring gene essentiality. Bioinformatics 2015; 31:2921-9. [PMID: 26002906 DOI: 10.1093/bioinformatics/btv312] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2015] [Accepted: 05/13/2015] [Indexed: 02/05/2023] Open
Abstract
MOTIVATION Genes with indispensable functions are identified as essential; however, the traditional gene-level studies of essentiality have several limitations. In this study, we characterized gene essentiality from a new perspective of protein domains, the independent structural or functional units of a polypeptide chain. RESULTS To identify such essential domains, we have developed an Expectation-Maximization (EM) algorithm-based Essential Domain Prediction (EDP) Model. With simulated datasets, the model provided convergent results given different initial values and offered accurate predictions even with noise. We then applied the EDP model to six microbial species and predicted 1879 domains to be essential in at least one species, ranging 10-23% in each species. The predicted essential domains were more conserved than either non-essential domains or essential genes. Comparing essential domains in prokaryotes and eukaryotes revealed an evolutionary distance consistent with that inferred from ribosomal RNA. When utilizing these essential domains to reproduce the annotation of essential genes, we received accurate results that suggest protein domains are more basic units for the essentiality of genes. Furthermore, we presented several examples to illustrate how the combination of essential and non-essential domains can lead to genes with divergent essentiality. In summary, we have described the first systematic analysis on gene essentiality on the level of domains. CONTACT huilu.bioinfo@gmail.com or Long.Lu@cchmc.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao Lu
- Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, 24/1400 Beijing (W) Road, Shanghai 200040, People's Republic of China
| | - Yulan Lu
- State Key Laboratory of Genetic Engineering Institute of Biostatistics, School of Life Science, Fudan University, Shanghai 200433, People's Republic of China
| | - Jingyuan Deng
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA
| | - Hai Peng
- Institute for Systems Biology, Jianghan University, Wuhan, Hubei, People's Republic of China
| | - Hui Lu
- Shanghai Institute of Medical Genetics, Shanghai Children's Hospital, Shanghai Jiao Tong University, 24/1400 Beijing (W) Road, Shanghai 200040, People's Republic of China, Department of Bioengineering (MC 063), University of Illinois at Chicago, Chicago, IL 60607-7052, USA and Collaborative Innovation Center for Biotherapy, West China Hospital, Sichuan University, Chengdu, China
| | - Long Jason Lu
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH 45229, USA, Institute for Systems Biology, Jianghan University, Wuhan, Hubei, People's Republic of China
| |
Collapse
|
8
|
Xu D, Jaroszewski L, Li Z, Godzik A. AIDA: ab initio domain assembly for automated multi-domain protein structure prediction and domain-domain interaction prediction. ACTA ACUST UNITED AC 2015; 31:2098-105. [PMID: 25701568 DOI: 10.1093/bioinformatics/btv092] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Accepted: 02/10/2015] [Indexed: 11/12/2022]
Abstract
MOTIVATION Most proteins consist of multiple domains, independent structural and evolutionary units that are often reshuffled in genomic rearrangements to form new protein architectures. Template-based modeling methods can often detect homologous templates for individual domains, but templates that could be used to model the entire query protein are often not available. RESULTS We have developed a fast docking algorithm ab initio domain assembly (AIDA) for assembling multi-domain protein structures, guided by the ab initio folding potential. This approach can be extended to discontinuous domains (i.e. domains with 'inserted' domains). When tested on experimentally solved structures of multi-domain proteins, the relative domain positions were accurately found among top 5000 models in 86% of cases. AIDA server can use domain assignments provided by the user or predict them from the provided sequence. The latter approach is particularly useful for automated protein structure prediction servers. The blind test consisting of 95 CASP10 targets shows that domain boundaries could be successfully determined for 97% of targets. AVAILABILITY AND IMPLEMENTATION The AIDA package as well as the benchmark sets used here are available for download at http://ffas.burnham.org/AIDA/. CONTACT adam@sanfordburnham.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dong Xu
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| | - Lukasz Jaroszewski
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| | - Zhanwen Li
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| | - Adam Godzik
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA, Center for Research in Biological Systems, University of California, San Diego, 9500 Gilman Dr. La Jolla, CA 92093-0446, USA and Center of Excellence in Genomic Medicine Research (CEGMR), King Fahad Medical Research Center, King Abdulaziz University, Jeddah, Kingdom of Saudi Arabia
| |
Collapse
|
9
|
Hernandez-Prieto MA, Kalathur RK, Futschik ME. Molecular Networks – Representation and Analysis. SPRINGER HANDBOOK OF BIO-/NEUROINFORMATICS 2014:399-418. [DOI: 10.1007/978-3-642-30574-0_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
10
|
Abstract
Proteins do not function in isolation; it is their interactions with one another and also with other molecules (e.g. DNA, RNA) that mediate metabolic and signaling pathways, cellular processes, and organismal systems. Due to their central role in biological function, protein interactions also control the mechanisms leading to healthy and diseased states in organisms. Diseases are often caused by mutations affecting the binding interface or leading to biochemically dysfunctional allosteric changes in proteins. Therefore, protein interaction networks can elucidate the molecular basis of disease, which in turn can inform methods for prevention, diagnosis, and treatment. In this chapter, we will describe the computational approaches to predict and map networks of protein interactions and briefly review the experimental methods to detect protein interactions. We will describe the application of protein interaction networks as a translational approach to the study of human disease and evaluate the challenges faced by these approaches.
Collapse
Affiliation(s)
- Mileidy W. Gonzalez
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Maricel G. Kann
- Biological Sciences, University of Maryland, Baltimore County, Baltimore, Maryland, United States of America
- * E-mail:
| |
Collapse
|
11
|
Konuma T, Lee YH, Goto Y, Sakurai K. Principal component analysis of chemical shift perturbation data of a multiple-ligand-binding system for elucidation of respective binding mechanism. Proteins 2012; 81:107-18. [PMID: 22927212 DOI: 10.1002/prot.24166] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2012] [Revised: 07/24/2012] [Accepted: 08/17/2012] [Indexed: 11/12/2022]
Abstract
Chemical shift perturbations (CSPs) in NMR spectra provide useful information about the interaction of a protein with its ligands. However, in a multiple-ligand-binding system, determining quantitative parameters such as a dissociation constant (K(d) ) is difficult. Here, we used a method we named CS-PCA, a principal component analysis (PCA) of chemical shift (CS) data, to analyze the interaction between bovine β-lactoglobulin (βLG) and 1-anilinonaphthalene-8-sulfonate (ANS), which is a multiple-ligand-binding system. The CSP on the binding of ANS involved contributions from two distinct binding sites. PCA of the titration data successfully separated the CSP pattern into contributions from each site. Docking simulations based on the separated CSP patterns provided the structures of βLG-ANS complexes for each binding site. In addition, we determined the K(d) values as 3.42 × 10⁻⁴ M² and 2.51 × 10⁻³ M for Sites 1 and 2, respectively. In contrast, it was difficult to obtain reliable K(d) values for respective sites from the isothermal titration calorimetry experiments. Two ANS molecules were found to bind at Site 1 simultaneously, suggesting that the binding occurs cooperatively with a partial unfolding of the βLG structure. On the other hand, the binding of ANS to Site 2 was a simple attachment without a significant conformational change. From the present results, CS-PCA was confirmed to provide not only the positions and the K(d) values of binding sites but also information about the binding mechanism. Thus, it is anticipated to be a general method to investigate protein-ligand interactions.
Collapse
Affiliation(s)
- Tsuyoshi Konuma
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | | | | | | |
Collapse
|
12
|
Stojmirović A, Yu YK. ppiTrim: constructing non-redundant and up-to-date interactomes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar036. [PMID: 21873645 PMCID: PMC3162744 DOI: 10.1093/database/bar036] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Robust advances in interactome analysis demand comprehensive, non-redundant and consistently annotated data sets. By non-redundant, we mean that the accounting of evidence for every interaction should be faithful: each independent experimental support is counted exactly once, no more, no less. While many interactions are shared among public repositories, none of them contains the complete known interactome for any model organism. In addition, the annotations of the same experimental result by different repositories often disagree. This brings up the issue of which annotation to keep while consolidating evidences that are the same. The iRefIndex database, including interactions from most popular repositories with a standardized protein nomenclature, represents a significant advance in all aspects, especially in comprehensiveness. However, iRefIndex aims to maintain all information/annotation from original sources and requires users to perform additional processing to fully achieve the aforementioned goals. Another issue has to do with protein complexes. Some databases represent experimentally observed complexes as interactions with more than two participants, while others expand them into binary interactions using spoke or matrix model. To avoid untested interaction information buildup, it is preferable to replace the expanded protein complexes, either from spoke or matrix models, with a flat list of complex members. To address these issues and to achieve our goals, we have developed ppiTrim, a script that processes iRefIndex to produce non-redundant, consistently annotated data sets of physical interactions. Our script proceeds in three stages: mapping all interactants to gene identifiers and removing all undesired raw interactions, deflating potentially expanded complexes, and reconciling for each interaction the annotation labels among different source databases. As an illustration, we have processed the three largest organismal data sets: yeast, human and fruitfly. While ppiTrim can resolve most apparent conflicts between different labelings, we also discovered some unresolvable disagreements mostly resulting from different annotation policies among repositories. Database URL:http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/downloads/ppiTrim.html
Collapse
Affiliation(s)
- Aleksandar Stojmirović
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
13
|
Ochoa A, Llinás M, Singh M. Using context to improve protein domain identification. BMC Bioinformatics 2011; 12:90. [PMID: 21453511 PMCID: PMC3090354 DOI: 10.1186/1471-2105-12-90] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2010] [Accepted: 03/31/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive. RESULTS Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite Plasmodium falciparum, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known. CONCLUSIONS Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at http://compbio.cs.princeton.edu/dpuc/. Pre-computed results for our test organisms and a web server are also available at that location.
Collapse
Affiliation(s)
- Alejandro Ochoa
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
| | | | | |
Collapse
|
14
|
Jain P, Hirst JD. Automatic structure classification of small proteins using random forest. BMC Bioinformatics 2010; 11:364. [PMID: 20594334 PMCID: PMC2916923 DOI: 10.1186/1471-2105-11-364] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2010] [Accepted: 07/01/2010] [Indexed: 11/29/2022] Open
Abstract
Background Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. Results Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. Conclusions The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.
Collapse
Affiliation(s)
- Pooja Jain
- School of Chemistry, The University of Nottingham, University Park, Nottingham, NG7 2RD, UK
| | | |
Collapse
|