1
|
Chen G, Hou L, Li Z, Xie B, Liu Y. A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding. Sci Rep 2025; 15:15236. [PMID: 40307455 PMCID: PMC12043993 DOI: 10.1038/s41598-025-99999-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Accepted: 04/24/2025] [Indexed: 05/02/2025] Open
Abstract
The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.
Collapse
Affiliation(s)
- Gaoxiang Chen
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.
| | - Liya Hou
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| | - Zhanwei Li
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| | - Bin Xie
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| | - Yongqiang Liu
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| |
Collapse
|
2
|
Yan C, Zhang Z, Xu J, Meng Y, Yan S, Wei L, Zou Q, Zhang Q, Cui F. CasPro-ESM2: Accurate identification of Cas proteins integrating pre-trained protein language model and multi-scale convolutional neural network. Int J Biol Macromol 2025; 308:142309. [PMID: 40127793 DOI: 10.1016/j.ijbiomac.2025.142309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2024] [Revised: 03/15/2025] [Accepted: 03/18/2025] [Indexed: 03/26/2025]
Abstract
Cas proteins (CRISPR-associated protein) are the core components of the CRISPR-Cas system, playing critical roles in defending against foreign DNA and RNA invasions. Identifying Cas proteins can provide deeper insights into the immune mechanisms of the CRISPR-Cas system and help uncover the functional mechanisms of Cas proteins. In this study, we developed a computational tool named CasPro-ESM2, which combines the Pre-trained Protein Language Model ESM-2, multi-scale convolutional neural networks, and evolutionary information from protein sequences to identify Cas proteins. Experimental results demonstrate that CasPro-ESM2 outperforms existing models in Cas protein identification, achieving the highest values in metrics such as ACC, SP, SN, and MCC on two different datasets. Furthermore, we deployed this tool on a web server to enable direct access for users (http://www.bioai-lab.com/CasProESM-2).
Collapse
Affiliation(s)
- Chaorui Yan
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Junlin Xu
- School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430081, Hubei, China
| | - Yajie Meng
- School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan 430200, Hubei, China
| | - Shankai Yan
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Leyi Wei
- Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao; School of Informatics, Xiamen University, Xiamen, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Qingchen Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China.
| |
Collapse
|
3
|
Vercauteren S, Fiesack S, Maroc L, Verstraeten N, Dewachter L, Michiels J, Vonesch SC. The rise and future of CRISPR-based approaches for high-throughput genomics. FEMS Microbiol Rev 2024; 48:fuae020. [PMID: 39085047 PMCID: PMC11409895 DOI: 10.1093/femsre/fuae020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 07/19/2024] [Accepted: 07/30/2024] [Indexed: 08/02/2024] Open
Abstract
Clustered regularly interspaced short palindromic repeats (CRISPR) has revolutionized the field of genome editing. To circumvent the permanent modifications made by traditional CRISPR techniques and facilitate the study of both essential and nonessential genes, CRISPR interference (CRISPRi) was developed. This gene-silencing technique employs a deactivated Cas effector protein and a guide RNA to block transcription initiation or elongation. Continuous improvements and a better understanding of the mechanism of CRISPRi have expanded its scope, facilitating genome-wide high-throughput screens to investigate the genetic basis of phenotypes. Additionally, emerging CRISPR-based alternatives have further expanded the possibilities for genetic screening. This review delves into the mechanism of CRISPRi, compares it with other high-throughput gene-perturbation techniques, and highlights its superior capacities for studying complex microbial traits. We also explore the evolution of CRISPRi, emphasizing enhancements that have increased its capabilities, including multiplexing, inducibility, titratability, predictable knockdown efficacy, and adaptability to nonmodel microorganisms. Beyond CRISPRi, we discuss CRISPR activation, RNA-targeting CRISPR systems, and single-nucleotide resolution perturbation techniques for their potential in genome-wide high-throughput screens in microorganisms. Collectively, this review gives a comprehensive overview of the general workflow of a genome-wide CRISPRi screen, with an extensive discussion of strengths and weaknesses, future directions, and potential alternatives.
Collapse
Affiliation(s)
- Silke Vercauteren
- Center for Microbiology, VIB - KU Leuven, Gaston Geenslaan 1, 3001 Leuven, Belgium
- Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, box 2460, 3001 Leuven, Belgium
| | - Simon Fiesack
- Center for Microbiology, VIB - KU Leuven, Gaston Geenslaan 1, 3001 Leuven, Belgium
- Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, box 2460, 3001 Leuven, Belgium
| | - Laetitia Maroc
- Center for Microbiology, VIB - KU Leuven, Gaston Geenslaan 1, 3001 Leuven, Belgium
- Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, box 2460, 3001 Leuven, Belgium
| | - Natalie Verstraeten
- Center for Microbiology, VIB - KU Leuven, Gaston Geenslaan 1, 3001 Leuven, Belgium
- Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, box 2460, 3001 Leuven, Belgium
| | - Liselot Dewachter
- de Duve Institute, Université catholique de Louvain, Hippokrateslaan 75, 1200 Brussels, Belgium
| | - Jan Michiels
- Center for Microbiology, VIB - KU Leuven, Gaston Geenslaan 1, 3001 Leuven, Belgium
- Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, box 2460, 3001 Leuven, Belgium
| | - Sibylle C Vonesch
- Center for Microbiology, VIB - KU Leuven, Gaston Geenslaan 1, 3001 Leuven, Belgium
- Centre of Microbial and Plant Genetics, KU Leuven, Kasteelpark Arenberg 20, box 2460, 3001 Leuven, Belgium
| |
Collapse
|
4
|
Madugula SS, Pujar P, Nammi B, Wang S, Jayasinghe-Arachchige VM, Pham T, Mashburn D, Artiles M, Liu J. Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum. J Chem Inf Model 2024; 64:4897-4911. [PMID: 38838358 DOI: 10.1021/acs.jcim.4c00625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the Streptococcus pyogenes Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.
Collapse
Affiliation(s)
- Sita Sirisha Madugula
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Pranav Pujar
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
| | - Bharani Nammi
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
| | - Shouyi Wang
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, 701 South Nedderman Drive, Arlington, Texas 76019, United States
| | - Vindi M Jayasinghe-Arachchige
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Tyler Pham
- School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Dominic Mashburn
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Maria Artiles
- School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| | - Jin Liu
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
- School of Biomedical Sciences, University of North Texas Health Science Center, 3500 Camp Bowie Blvd, Fort Worth, Texas 76107, United States
| |
Collapse
|
5
|
Madugula SS, Pujar P, Bharani N, Wang S, Jayasinghe-Arachchige VM, Pham T, Mashburn D, Artilis M, Liu J. Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.22.576286. [PMID: 38328240 PMCID: PMC10849529 DOI: 10.1101/2024.01.22.576286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations like large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In the current study, we aim to elucidate the unique protein attributes associated with Cas9 and Cas12 families and identify the features that distinguish each family from the other. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,495 features) encoding various physiochemical, topological, constitutional, and coevolutionary information of Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and Non-Cas proteins. All the models were evaluated rigorously on the test and independent datasets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 95% and 97% on their respective independent datasets, while the multiclass classifier achieved a high F1 score of 0.97. We observed that Quasi-sequence-order descriptors like Schneider-lag descriptors and Composition descriptors like charge, volume, and polarizability are essential for the Cas12 family. More interestingly, we discovered that Amino Acid Composition descriptors, especially the Tripeptide Composition (TPC) descriptors, are important for the Cas9 family. Four of the identified important descriptors of Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all the Cas9 proteins and were located within different catalytically important domains of the Cas9 protein structure. Among these four tripeptides, tripeptides DHI and HHA are well-known to be involved in the DNA cleavage activity of the Cas9 protein. We therefore propose the the other two tripeptides, PWN and PYY, may also be essential for the Cas9 family. Our identified important descriptors enhanced the understanding of the catalytic mechanisms of Cas9 and Cas12 proteins and provide valuable insights into design of novel Cas systems to achieve enhanced gene-editing properties.
Collapse
Affiliation(s)
- Sita Sirisha Madugula
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Pranav Pujar
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States
| | - Nammi Bharani
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States
| | - Shouyi Wang
- Department of Industrial, Manufacturing and Systems Engineering, University of Texas at Arlington, Arlington, Texas, United States
| | - Vindi M. Jayasinghe-Arachchige
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Tyler Pham
- Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas
| | - Dominic Mashburn
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Maria Artilis
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
| | - Jin Liu
- Department of Pharmaceutical Sciences, University of North Texas System College of Pharmacy, University of North Texas Health Science Center, Fort Worth, Texas, United States
- Graduate School of Biomedical Sciences, University of North Texas Health Science Center, Fort Worth, Texas
| |
Collapse
|
6
|
Ullah N, Yang N, Guan Z, Xiang K, Wang Y, Diaby M, Chen C, Gao B, Song C. Comparative Analysis and Phylogenetic Insights of Cas14-Homology Proteins in Bacteria and Archaea. Genes (Basel) 2023; 14:1911. [PMID: 37895260 PMCID: PMC10606334 DOI: 10.3390/genes14101911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 10/29/2023] Open
Abstract
Type-V-F Cas12f proteins, also known as Cas14, have drawn significant interest within the diverse CRISPR-Cas nucleases due to their compact size. This study involves analyzing and comparing Cas14-homology proteins in prokaryotic genomes through mining, sequence comparisons, a phylogenetic analysis, and an array/repeat analysis. In our analysis, we identified and mined a total of 93 Cas14-homology proteins that ranged in size from 344 aa to 843 aa. The majority of the Cas14-homology proteins discovered in this analysis were found within the Firmicutes group, which contained 37 species, representing 42% of all the Cas14-homology proteins identified. In archaea, the DPANN group had the highest number of species containing Cas14-homology proteins, a total of three species. The phylogenetic analysis results demonstrate the division of Cas14-homology proteins into three clades: Cas14-A, Cas14-B, and Cas14-U. Extensive similarity was observed at the C-terminal end (CTD) through a domain comparison of the three clades, suggesting a potentially shared mechanism of action due to the presence of cutting domains in that region. Additionally, a sequence similarity analysis of all the identified Cas14 sequences indicated a low level of similarity (18%) between the protein variants. The analysis of repeats/arrays in the extended nucleotide sequences of the identified Cas14-homology proteins highlighted that 44 out of the total mined proteins possessed CRISPR-associated repeats, with 20 of them being specific to Cas14. Our study contributes to the increased understanding of Cas14 proteins across prokaryotic genomes. These homologous proteins have the potential for future applications in the mining and engineering of Cas14 proteins.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Chengyi Song
- College of Animal Science and Technology, Yangzhou University, Yangzhou 225009, China; (N.U.); (N.Y.); (Z.G.); (K.X.); (Y.W.); (M.D.); (C.C.); (B.G.)
| |
Collapse
|
7
|
Jeong E, Kim W, Son S, Yang S, Gwon D, Hong J, Cho Y, Jang CY, Steinegger M, Lim YW, Kang KB. Qualitative metabolomics-based characterization of a phenolic UDP-xylosyltransferase with a broad substrate spectrum from Lentinus brumalis. Proc Natl Acad Sci U S A 2023; 120:e2301007120. [PMID: 37399371 PMCID: PMC10334773 DOI: 10.1073/pnas.2301007120] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 06/06/2023] [Indexed: 07/05/2023] Open
Abstract
Wood-decaying fungi are the major decomposers of plant litter. Heavy sequencing efforts on genomes of wood-decaying fungi have recently been made due to the interest in their lignocellulolytic enzymes; however, most parts of their proteomes remain uncharted. We hypothesized that wood-decaying fungi would possess promiscuous enzymes for detoxifying antifungal phytochemicals remaining in the dead plant bodies, which can be useful biocatalysts. We designed a computational mass spectrometry-based untargeted metabolomics pipeline for the phenotyping of biotransformation and applied it to 264 fungal cultures supplemented with antifungal plant phenolics. The analysis identified the occurrence of diverse reactivities by the tested fungal species. Among those, we focused on O-xylosylation of multiple phenolics by one of the species tested, Lentinus brumalis. By integrating the metabolic phenotyping results with publicly available genome sequences and transcriptome analysis, a UDP-glycosyltransferase designated UGT66A1 was identified and validated as an enzyme catalyzing O-xylosylation with broad substrate specificity. We anticipate that our analytical workflow will accelerate the further characterization of fungal enzymes as promising biocatalysts.
Collapse
Affiliation(s)
- Eunah Jeong
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
- Research Institute of Pharmaceutical Sciences and Muscle Physiome Research Center, Sookmyung Women’s University, Seoul04310, Korea
| | - Wonyong Kim
- Korean Lichen Research Institute, Sunchon National University, Suncheon57922, Korea
| | - Seungju Son
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
| | - Sungyeon Yang
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
| | - Dasom Gwon
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
- Research Institute of Pharmaceutical Sciences and Muscle Physiome Research Center, Sookmyung Women’s University, Seoul04310, Korea
| | - Jihee Hong
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
- Research Institute of Pharmaceutical Sciences and Muscle Physiome Research Center, Sookmyung Women’s University, Seoul04310, Korea
| | - Yoonhee Cho
- School of Biological Sciences, Seoul National University, Seoul08826, Korea
| | - Chang-Young Jang
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
- Research Institute of Pharmaceutical Sciences and Muscle Physiome Research Center, Sookmyung Women’s University, Seoul04310, Korea
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul08826, Korea
- Artificial Intelligence Institute, Seoul National University, Seoul08826, Korea
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul08826, Korea
| | - Young Woon Lim
- School of Biological Sciences, Seoul National University, Seoul08826, Korea
- Institute of Microbiology, Seoul National University, Seoul08826, Korea
| | - Kyo Bin Kang
- College of Pharmacy, Sookmyung Women’s University, Seoul04310, Korea
- Research Institute of Pharmaceutical Sciences and Muscle Physiome Research Center, Sookmyung Women’s University, Seoul04310, Korea
| |
Collapse
|
8
|
Zhao X, Sun C, Jin M, Chen J, Xing L, Yan J, Wang H, Liu Z, Chen WH. Enrichment Culture but Not Metagenomic Sequencing Identified a Highly Prevalent Phage Infecting Lactiplantibacillus plantarum in Human Feces. Microbiol Spectr 2023; 11:e0434022. [PMID: 36995238 PMCID: PMC10269749 DOI: 10.1128/spectrum.04340-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 03/07/2023] [Indexed: 03/31/2023] Open
Abstract
Lactiplantibacillus plantarum (previously known as Lactobacillus plantarum) is increasingly used as a probiotic to treat human diseases, but its phages in the human gut remain unexplored. Here, we report its first gut phage, Gut-P1, which we systematically screened using metagenomic sequencing, virus-like particle (VLP) sequencing, and enrichment culture from 35 fecal samples. Gut-P1 is virulent, belongs to the Douglaswolinvirus genus, and is highly prevalent in the gut (~11% prevalence); it has a genome of 79,928 bp consisting of 125 protein coding genes and displaying low sequence similarities to public L. plantarum phages. Physiochemical characterization shows that it has a short latent period and adapts to broad ranges of temperatures and pHs. Furthermore, Gut-P1 strongly inhibits the growth of L. plantarum strains at a multiplicity of infection (MOI) of 1e-6. Together, these results indicate that Gut-P1 can greatly impede the application of L. plantarum in humans. Strikingly, Gut-P1 was identified only in the enrichment culture, not in our metagenomic or VLP sequencing data nor in any public human phage databases, indicating the inefficiency of bulk sequencing in recovering low-abundance but highly prevalent phages and pointing to the unexplored hidden diversity of the human gut virome despite recent large-scale sequencing and bioinformatics efforts. IMPORTANCE As Lactiplantibacillus plantarum (previously known as Lactobacillus plantarum) is increasingly used as a probiotic to treat human gut-related diseases, its bacteriophages may pose a certain threat to their further application and should be identified and characterized more often from the human intestine. Here, we isolated and identified the first gut L. plantarum phage that is prevalent in a Chinese population. This phage, Gut-P1, is virulent and can strongly inhibit the growth of multiple L. plantarum strains at low MOIs. Our results also show that bulk sequencing is inefficient at recovering low-abundance but highly prevalent phages such as Gut-P1, suggesting that the hidden diversity of human enteroviruses has not yet been explored. Our results call for innovative approaches to isolate and identify intestinal phages from the human gut and to rethink our current understanding of the enterovirus, particularly its underestimated diversity and overestimated individual specificity.
Collapse
Affiliation(s)
- Xueyang Zhao
- College of Life Science, Henan Normal University, Xinxiang, Henan, China
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Chuqing Sun
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Menglu Jin
- College of Life Science, Henan Normal University, Xinxiang, Henan, China
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Jingchao Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Lulu Xing
- College of Life Science, Henan Normal University, Xinxiang, Henan, China
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Jin Yan
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Hailei Wang
- College of Life Science, Henan Normal University, Xinxiang, Henan, China
| | - Zhi Liu
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | - Wei-Hua Chen
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center for Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
- Institution of Medical Artificial Intelligence, Binzhou Medical University, Yantai, China
| |
Collapse
|
9
|
Zhang T, Jia Y, Li H, Xu D, Zhou J, Wang G. CRISPRCasStack: a stacking strategy-based ensemble learning framework for accurate identification of Cas proteins. Brief Bioinform 2022; 23:6674167. [PMID: 35998924 DOI: 10.1093/bib/bbac335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Revised: 07/13/2022] [Accepted: 07/23/2022] [Indexed: 11/12/2022] Open
Abstract
CRISPR-Cas system is an adaptive immune system widely found in most bacteria and archaea to defend against exogenous gene invasion. One of the most critical steps in the study of exploring and classifying novel CRISPR-Cas systems and their functional diversity is the identification of Cas proteins in CRISPR-Cas systems. The discovery of novel Cas proteins has also laid the foundation for technologies such as CRISPR-Cas-based gene editing and gene therapy. Currently, accurate and efficient screening of Cas proteins from metagenomic sequences and proteomic sequences remains a challenge. For Cas proteins with low sequence conservation, existing tools for Cas protein identification based on homology cannot guarantee identification accuracy and efficiency. In this paper, we have developed a novel stacking-based ensemble learning framework for Cas protein identification, called CRISPRCasStack. In particular, we applied the SHAP (SHapley Additive exPlanations) method to analyze the features used in CRISPRCasStack. Sufficient experimental validation and independent testing have demonstrated that CRISPRCasStack can address the accuracy deficiencies and inefficiencies of the existing state-of-the-art tools. We also provide a toolkit to accurately identify and analyze potential Cas proteins, Cas operons, CRISPR arrays and CRISPR-Cas locus in prokaryotic sequences. The CRISPRCasStack toolkit is available at https://github.com/yrjia1015/CRISPRCasStack.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Dali Xu
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Jie Zhou
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, 150040, China
| |
Collapse
|
10
|
Call SN, Andrews LB. CRISPR-Based Approaches for Gene Regulation in Non-Model Bacteria. Front Genome Ed 2022; 4:892304. [PMID: 35813973 PMCID: PMC9260158 DOI: 10.3389/fgeed.2022.892304] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 04/11/2022] [Indexed: 01/08/2023] Open
Abstract
CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) have become ubiquitous approaches to control gene expression in bacteria due to their simple design and effectiveness. By regulating transcription of a target gene(s), CRISPRi/a can dynamically engineer cellular metabolism, implement transcriptional regulation circuitry, or elucidate genotype-phenotype relationships from smaller targeted libraries up to whole genome-wide libraries. While CRISPRi/a has been primarily established in the model bacteria Escherichia coli and Bacillus subtilis, a growing numbering of studies have demonstrated the extension of these tools to other species of bacteria (here broadly referred to as non-model bacteria). In this mini-review, we discuss the challenges that contribute to the slower creation of CRISPRi/a tools in diverse, non-model bacteria and summarize the current state of these approaches across bacterial phyla. We find that despite the potential difficulties in establishing novel CRISPRi/a in non-model microbes, over 190 recent examples across eight bacterial phyla have been reported in the literature. Most studies have focused on tool development or used these CRISPRi/a approaches to interrogate gene function, with fewer examples applying CRISPRi/a gene regulation for metabolic engineering or high-throughput screens and selections. To date, most CRISPRi/a reports have been developed for common strains of non-model bacterial species, suggesting barriers remain to establish these genetic tools in undomesticated bacteria. More efficient and generalizable methods will help realize the immense potential of programmable CRISPR-based transcriptional control in diverse bacteria.
Collapse
Affiliation(s)
- Stephanie N. Call
- Department of Chemical Engineering, University of Massachusetts Amherst, Amherst, MA, United States
| | - Lauren B. Andrews
- Department of Chemical Engineering, University of Massachusetts Amherst, Amherst, MA, United States
- Biotechnology Training Program, University of Massachusetts Amherst, Amherst, MA, United States
- Molecular and Cellular Biology Graduate Program, University of Massachusetts Amherst, Amherst, MA, United States
| |
Collapse
|
11
|
Ambroa A, Blasco L, López M, Pacios O, Bleriot I, Fernández-García L, González de Aledo M, Ortiz-Cartagena C, Millard A, Tomás M. Genomic Analysis of Molecular Bacterial Mechanisms of Resistance to Phage Infection. Front Microbiol 2022; 12:784949. [PMID: 35250902 PMCID: PMC8891609 DOI: 10.3389/fmicb.2021.784949] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 12/27/2021] [Indexed: 12/27/2022] Open
Abstract
To optimize phage therapy, we need to understand how bacteria evolve against phage attacks. One of the main problems of phage therapy is the appearance of bacterial resistance variants. The use of genomics to track antimicrobial resistance is increasingly developed and used in clinical laboratories. For that reason, it is important to consider, in an emerging future with phage therapy, to detect and avoid phage-resistant strains that can be overcome by the analysis of metadata provided by whole-genome sequencing. Here, we identified genes associated with phage resistance in 18 Acinetobacter baumannii clinical strains belonging to the ST-2 clonal complex during a decade (Ab2000 vs. 2010): 9 from 2000 to 9 from 2010. The presence of genes putatively associated with phage resistance was detected. Genes detected were associated with an abortive infection system, restriction-modification system, genes predicted to be associated with defense systems but with unknown function, and CRISPR-Cas system. Between 118 and 171 genes were found in the 18 clinical strains. On average, 26% of these genes were detected inside genomic islands in the 2000 strains and 32% in the 2010 strains. Furthermore, 38 potential CRISPR arrays in 17 of 18 of the strains were found, as well as 705 proteins associated with CRISPR-Cas systems. A moderately higher presence of these genes in the strains of 2010 in comparison with those of 2000 was found, especially those related to the restriction-modification system and CRISPR-Cas system. The presence of these genes in genomic islands at a higher rate in the strains of 2010 compared with those of 2000 was also detected. Whole-genome sequencing and bioinformatics could be powerful tools to avoid drawbacks when a personalized therapy is applied. In this study, it allows us to take care of the phage resistance in A. baumannii clinical strains to prevent a failure in possible phage therapy.
Collapse
Affiliation(s)
- Antón Ambroa
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
| | - Lucia Blasco
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
| | - María López
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
- Spanish Network for Research in Infectious Diseases (REIPI), Infectious Diseases Network Biomedical Research Center (CIBERINFEC), Carlos III Health Institute, Madrid, Spain
| | - Olga Pacios
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
| | - Inés Bleriot
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
| | - Laura Fernández-García
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
| | - Manuel González de Aledo
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
| | - Concha Ortiz-Cartagena
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
| | - Andrew Millard
- Department of Genetics and Genome Biology, University of Leicester, Leicester, United Kingdom
| | - María Tomás
- Microbiology Department-Research Institute Biomedical A Coruña (INIBIC), Hospital A Coruña (CHUAC), University of A Coruña (UDC), A Coruña, Spain
- Study Group on Mechanisms of Action and Resistance to Antimicrobials (GEMARA) the Behalf of the Spanish Society of Infectious Diseases and Clinical Microbiology (SEIMC), Madrid, Spain
- Spanish Network for Research in Infectious Diseases (REIPI), Infectious Diseases Network Biomedical Research Center (CIBERINFEC), Carlos III Health Institute, Madrid, Spain
| |
Collapse
|
12
|
Yang B, Ding M, Chen Y, Han F, Yang C, Zhao J, Malard P, Stanton C, Ross RP, Zhang H, Chen W. Development of gut microbiota and bifidobacterial communities of neonates in the first 6 weeks and their inheritance from mother. Gut Microbes 2022; 13:1-13. [PMID: 33847206 PMCID: PMC8049200 DOI: 10.1080/19490976.2021.1908100] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Microbiota especially Bifidobacterium play an important role in adjusting and maintaining homeostatic balance within the infant intestine. The aim of this study was to elucidate the relationship between maternal and infant gut microbiota and identify the Bifidobacterium species that may transfer from mother to infant over the first 42 days of the infant's life. Nineteen mother-infant-pair fecal samples were collected and the diversity and composition of the total bacterial and Bifidobacterium communities were analyzed via 16S rDNA and bifidobacterial groEL gene high throughput sequencing. The results revealed that the relative abundance of Bifidobacterium was significantly higher in the infant gut while Parabacteroides, Blautia, Coprococcus, Lachnospira and Faecalibacterium were at lower relative abundance in 7-day and 42-day infant fecal samples compared to the maternal samples. The maternal gut has more B. pseudocatenulatum. In the infant group, B. breve and B. dentium relative abundance increased while B. animalis subsp. lactis decreased from days 7 to 42. Additionally, B. longum subsp. longum isolated from FGZ16 and FGZ35 may have transferred from mother to infant and colonized the infant gut. The results of the current study provide insight toward the infant gut microbiota composition and structure during the first 42 days and may help guide Bifidobacterium supplementation strategies in mothers and infants.
Collapse
Affiliation(s)
- Bo Yang
- State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China,School of Food Science and Technology, Jiangnan University, Wuxi, China,International Joint Research Laboratory for Pharmabiotics & Antibiotic Resistance, Jiangnan University, Wuxi, China
| | - Mengfan Ding
- State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China,School of Food Science and Technology, Jiangnan University, Wuxi, China
| | - Yingqi Chen
- State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China,School of Food Science and Technology, Jiangnan University, Wuxi, China
| | - Fengzhen Han
- Department of Gynaecology and Obsterics, Guangdong Province People’s Hospital, Guangdong Academy of Medical Science, Guangzhou, China
| | - Chunyan Yang
- Department of Gynaecology and Obsterics, Guangdong Province People’s Hospital, Guangdong Academy of Medical Science, Guangzhou, China
| | - Jianxin Zhao
- State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China,School of Food Science and Technology, Jiangnan University, Wuxi, China,National Engineering Research Center for Functional Food, Jiangnan University, Wuxi, China
| | - Patrice Malard
- Biostime (Guangzhou) Health Products Ltd., Guangzhou, China
| | - Catherine Stanton
- International Joint Research Laboratory for Pharmabiotics & Antibiotic Resistance, Jiangnan University, Wuxi, China,Food Bioscience, Teagasc Food Research Centre, Fermoy, Ireland,CONTACT Catherine Stanton Teagasc Food Research Centre, Fermoy, Ireland
| | - R. Paul Ross
- International Joint Research Laboratory for Pharmabiotics & Antibiotic Resistance, Jiangnan University, Wuxi, China,APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Hao Zhang
- State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China,School of Food Science and Technology, Jiangnan University, Wuxi, China,National Engineering Research Center for Functional Food, Jiangnan University, Wuxi, China,Wuxi Translational Medicine Research Center and Jiangsu Translational Medicine Research Institute Wuxi Branch, Wuxi, China
| | - Wei Chen
- State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi, China,School of Food Science and Technology, Jiangnan University, Wuxi, China,National Engineering Research Center for Functional Food, Jiangnan University, Wuxi, China,Wei Chen School of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
13
|
Yang S, Huang J, He B. CASPredict: a web service for identifying Cas proteins. PeerJ 2021; 9:e11887. [PMID: 34395100 PMCID: PMC8327967 DOI: 10.7717/peerj.11887] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 07/09/2021] [Indexed: 12/16/2022] Open
Abstract
Clustered regularly interspaced short palindromic repeats (CRISPR) and their associated (Cas) proteins constitute the CRISPR-Cas systems, which play a key role in prokaryote adaptive immune system against invasive foreign elements. In recent years, the CRISPR-Cas systems have also been designed to facilitate target gene editing in eukaryotic genomes. As one of the important components of the CRISPR-Cas system, Cas protein plays an irreplaceable role. The effector module composed of Cas proteins is used to distinguish the type of CRISPR-Cas systems. Effective prediction and identification of Cas proteins can help biologists further infer the type of CRISPR-Cas systems. Moreover, the class 2 CRISPR-Cas systems are gradually applied in the field of genome editing. The discovery of Cas protein will help provide more candidates for genome editing. In this paper, we described a web service named CASPredict (http://i.uestc.edu.cn/caspredict/cgi-bin/CASPredict.pl) for identifying Cas proteins. CASPredict first predicts Cas proteins based on support vector machine (SVM) by using the optimal dipeptide composition and then annotates the function of Cas proteins based on the hmmscan search algorithm. The ten-fold cross-validation results showed that the 84.84% of Cas proteins were correctly classified. CASPredict will be a useful tool for the identification of Cas proteins, or at least can play a complementary role to the existing methods in this area.
Collapse
Affiliation(s)
- Shanshan Yang
- Medical College, Guizhou University, Guiyang, Guizhou Province, China
| | - Jian Huang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan Province, China
| | - Bifang He
- Medical College, Guizhou University, Guiyang, Guizhou Province, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, Sichuan Province, China
| |
Collapse
|
14
|
Dvorkina T, Bankevich A, Sorokin A, Yang F, Adu-Oppong B, Williams R, Turner K, Pevzner PA. ORFograph: search for novel insecticidal protein genes in genomic and metagenomic assembly graphs. MICROBIOME 2021; 9:149. [PMID: 34183047 PMCID: PMC8240309 DOI: 10.1186/s40168-021-01092-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/11/2021] [Indexed: 05/07/2023]
Abstract
BACKGROUND Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. METHODS Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. RESULTS We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. CONCLUSIONS We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.
Collapse
Affiliation(s)
- Tatiana Dvorkina
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Anton Bankevich
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| | - Alexei Sorokin
- Université Paris-Saclay, INRAE, Micalis Institute, AgroParisTech, 78350 Jouy-en-Josas, France
| | - Fan Yang
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Ascus Biosciences, San Diego, CA USA
| | - Boahemaa Adu-Oppong
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
- Thermo Fisher Scientific, Carlsbad, CA USA
| | - Ryan Williams
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Keith Turner
- Data Science & Analytics, Bayer U.S. - Crop Science, Chesterfield, MO USA
| | - Pavel A. Pevzner
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA USA
| |
Collapse
|
15
|
The Development of Bacteriophage Resistance in Vibrio alginolyticus Depends on a Complex Metabolic Adaptation Strategy. Viruses 2021; 13:v13040656. [PMID: 33920240 PMCID: PMC8069663 DOI: 10.3390/v13040656] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Revised: 04/07/2021] [Accepted: 04/08/2021] [Indexed: 12/23/2022] Open
Abstract
Lytic bacteriophages have been well documented to play a pivotal role in microbial ecology due to their complex interactions with bacterial species, especially in aquatic habitats. Although the use of phages as antimicrobial agents, known as phage therapy, in the aquatic environment has been increasing, recent research has revealed drawbacks due to the development of phage-resistant strains among Gram-negative species. Acquired phage resistance in marine Vibrios has been proven to be a very complicated process utilizing biochemical, metabolic, and molecular adaptation strategies. The results of our multi-omics approach, incorporating transcriptome and metabolome analyses of Vibrio alginolyticus phage-resistant strains, corroborate this prospect. Our results provide insights into phage-tolerant strains diminishing the expression of phage receptors ompF, lamB, and btuB. The same pattern was observed for genes encoding natural nutrient channels, such as rbsA, ptsG, tryP, livH, lysE, and hisp, meaning that the cell needs to readjust its biochemistry to achieve phage resistance. The results showed reprogramming of bacterial metabolism by transcript regulations in key-metabolic pathways, such as the tricarboxylic acid cycle (TCA) and lysine biosynthesis, as well as the content of intracellular metabolites belonging to processes that could also significantly affect the cell physiology. Finally, SNP analysis in resistant strains revealed no evidence of amino acid alterations in the studied putative bacterial phage receptors, but several SNPs were detected in genes involved in transcriptional regulation. This phenomenon appears to be a phage-specific, fine-tuned metabolic engineering, imposed by the different phage genera the bacteria have interacted with, updating the role of lytic phages in microbial marine ecology.
Collapse
|
16
|
Wang Y, Kang J, Li N, Zhou Y, Tang Z, He B, Huang J. NeuroCS: A Tool to Predict Cleavage Sites of Neuropeptide Precursors. Protein Pept Lett 2020; 27:337-345. [PMID: 31721688 DOI: 10.2174/0929866526666191112150636] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Revised: 07/16/2019] [Accepted: 09/24/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Neuropeptides are a class of bioactive peptides produced from neuropeptide precursors through a series of extremely complex processes, mediating neuronal regulations in many aspects. Accurate identification of cleavage sites of neuropeptide precursors is of great significance for the development of neuroscience and brain science. OBJECTIVE With the explosive growth of neuropeptide precursor data, it is pretty much needed to develop bioinformatics methods for predicting neuropeptide precursors' cleavage sites quickly and efficiently. METHODS We started with processing the neuropeptide precursor data from SwissProt and NueoPedia into two sets of data, training dataset and testing dataset. Subsequently, six feature extraction schemes were applied to generate different feature sets and then feature selection methods were used to find the optimal feature subset of each. Thereafter the support vector machine was utilized to build models for different feature types. Finally, the performance of models were evaluated with the independent testing dataset. RESULTS Six models are built through support vector machine. Among them the enhanced amino acid composition-based model reaches the highest accuracy of 91.60% in the 5-fold cross validation. When evaluated with independent testing dataset, it also showed an excellent performance with a high accuracy of 90.37% and Area under Receiver Operating Characteristic curve up to 0.9576. CONCLUSION The performance of the developed model was decent. Moreover, for users' convenience, an online web server called NeuroCS is built, which is freely available at http://i.uestc.edu.cn/NeuroCS/dist/index.html#/. NeuroCS can be used to predict neuropeptide precursors' cleavage sites effectively.
Collapse
Affiliation(s)
- Ying Wang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Juanjuan Kang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ning Li
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yuwei Zhou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zhongjie Tang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Bifang He
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China.,Medical College, Guizhou University, Guiyang, China
| | - Jian Huang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
17
|
Padilha VA, Alkhnbashi OS, Shah SA, de Carvalho ACPLF, Backofen R. CRISPRcasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems. Gigascience 2020; 9:giaa062. [PMID: 32556168 PMCID: PMC7298778 DOI: 10.1093/gigascience/giaa062] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 04/27/2020] [Accepted: 05/15/2020] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND CRISPR-Cas genes are extraordinarily diverse and evolve rapidly when compared to other prokaryotic genes. With the rapid increase in newly sequenced archaeal and bacterial genomes, manual identification of CRISPR-Cas systems is no longer viable. Thus, an automated approach is required for advancing our understanding of the evolution and diversity of these systems and for finding new candidates for genome engineering in eukaryotic models. RESULTS We introduce CRISPRcasIdentifier, a new machine learning-based tool that combines regression and classification models for the prediction of potentially missing proteins in instances of CRISPR-Cas systems and the prediction of their respective subtypes. In contrast to other available tools, CRISPRcasIdentifier can both detect cas genes and extract potential association rules that reveal functional modules for CRISPR-Cas systems. In our experimental benchmark on the most recently published and comprehensive CRISPR-Cas system dataset, CRISPRcasIdentifier was compared with recent and state-of-the-art tools. According to the experimental results, CRISPRcasIdentifier presented the best Cas protein identification and subtype classification performance. CONCLUSIONS Overall, our tool greatly extends the classification of CRISPR cassettes and, for the first time, predicts missing Cas proteins and association rules between Cas proteins. Additionally, we investigated the properties of CRISPR subtypes. The proposed tool relies not only on the knowledge of manual CRISPR annotation but also on models trained using machine learning.
Collapse
Affiliation(s)
- Victor A Padilha
- Institute of Mathematics and Computer Sciences, University of São Paulo, Av. Trabalhador São Carlense 400, São Carlos, SP, 13566-590, Brazil
| | - Omer S Alkhnbashi
- Bioinformatics Group, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
| | - Shiraz A Shah
- COPSAC, Copenhagen University Hospitals Herlev and Gentofte, Ledreborg Alle 34, DK-2820 Gentofte, Denmark
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, Av. Trabalhador São Carlense 400, São Carlos, SP, 13566-590, Brazil
| | - Rolf Backofen
- Bioinformatics Group, University of Freiburg, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Schaenzlestr. 18, 79104 Freiburg, Germany
| |
Collapse
|
18
|
Pourcel C, Touchon M, Villeriot N, Vernadet JP, Couvin D, Toffano-Nioche C, Vergnaud G. CRISPRCasdb a successor of CRISPRdb containing CRISPR arrays and cas genes from complete genome sequences, and tools to download and query lists of repeats and spacers. Nucleic Acids Res 2020; 48:D535-D544. [PMID: 31624845 PMCID: PMC7145573 DOI: 10.1093/nar/gkz915] [Citation(s) in RCA: 72] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2019] [Revised: 09/20/2019] [Accepted: 10/04/2019] [Indexed: 12/28/2022] Open
Abstract
In Archaea and Bacteria, the arrays called CRISPRs for 'clustered regularly interspaced short palindromic repeats' and the CRISPR associated genes or cas provide adaptive immunity against viruses, plasmids and transposable elements. Short sequences called spacers, corresponding to fragments of invading DNA, are stored in-between repeated sequences. The CRISPR-Cas systems target sequences homologous to spacers leading to their degradation. To facilitate investigations of CRISPRs, we developed 12 years ago a website holding the CRISPRdb. We now propose CRISPRCasdb, a completely new version giving access to both CRISPRs and cas genes. We used CRISPRCasFinder, a program that identifies CRISPR arrays and cas genes and determine the system's type and subtype, to process public whole genome assemblies. Strains are displayed either in an alphabetic list or in taxonomic order. The database is part of the CRISPR-Cas++ website which also offers the possibility to analyse submitted sequences and to download programs. A BLAST search against lists of repeats and spacers extracted from the database is proposed. To date, 16 990 complete prokaryote genomes (16 650 bacteria from 2973 species and 340 archaea from 300 species) are included. CRISPR-Cas systems were found in 36% of Bacteria and 75% of Archaea strains. CRISPRCasdb is freely accessible at https://crisprcas.i2bc.paris-saclay.fr/.
Collapse
Affiliation(s)
- Christine Pourcel
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Marie Touchon
- Microbial Evolutionary Genomics, Institut Pasteur, 25-28 rue du Docteur Roux, 75015 Paris, France.,CNRS, UMR3525, 25-28 rue du Docteur Roux, 75015 Paris, France
| | - Nicolas Villeriot
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Jean-Philippe Vernadet
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - David Couvin
- Unité Transmission, Réservoir et Diversité des Pathogènes, Institut Pasteur de Guadeloupe, 97139 Les Abymes, France
| | - Claire Toffano-Nioche
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Gilles Vergnaud
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| |
Collapse
|
19
|
Pourcel C, Touchon M, Villeriot N, Vernadet JP, Couvin D, Toffano-Nioche C, Vergnaud G. CRISPRCasdb a successor of CRISPRdb containing CRISPR arrays and cas genes from complete genome sequences, and tools to download and query lists of repeats and spacers. Nucleic Acids Res 2020. [PMID: 31624845 DOI: 10.1093/nar/gkz915.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In Archaea and Bacteria, the arrays called CRISPRs for 'clustered regularly interspaced short palindromic repeats' and the CRISPR associated genes or cas provide adaptive immunity against viruses, plasmids and transposable elements. Short sequences called spacers, corresponding to fragments of invading DNA, are stored in-between repeated sequences. The CRISPR-Cas systems target sequences homologous to spacers leading to their degradation. To facilitate investigations of CRISPRs, we developed 12 years ago a website holding the CRISPRdb. We now propose CRISPRCasdb, a completely new version giving access to both CRISPRs and cas genes. We used CRISPRCasFinder, a program that identifies CRISPR arrays and cas genes and determine the system's type and subtype, to process public whole genome assemblies. Strains are displayed either in an alphabetic list or in taxonomic order. The database is part of the CRISPR-Cas++ website which also offers the possibility to analyse submitted sequences and to download programs. A BLAST search against lists of repeats and spacers extracted from the database is proposed. To date, 16 990 complete prokaryote genomes (16 650 bacteria from 2973 species and 340 archaea from 300 species) are included. CRISPR-Cas systems were found in 36% of Bacteria and 75% of Archaea strains. CRISPRCasdb is freely accessible at https://crisprcas.i2bc.paris-saclay.fr/.
Collapse
Affiliation(s)
- Christine Pourcel
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Marie Touchon
- Microbial Evolutionary Genomics, Institut Pasteur, 25-28 rue du Docteur Roux, 75015 Paris, France.,CNRS, UMR3525, 25-28 rue du Docteur Roux, 75015 Paris, France
| | - Nicolas Villeriot
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Jean-Philippe Vernadet
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - David Couvin
- Unité Transmission, Réservoir et Diversité des Pathogènes, Institut Pasteur de Guadeloupe, 97139 Les Abymes, France
| | - Claire Toffano-Nioche
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Gilles Vergnaud
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| |
Collapse
|
20
|
Jiang L, Yu M, Zhou Y, Tang Z, Li N, Kang J, He B, Huang J. AGONOTES: A Robot Annotator for Argonaute Proteins. Interdiscip Sci 2019; 12:109-116. [PMID: 31741225 DOI: 10.1007/s12539-019-00349-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 10/06/2019] [Accepted: 10/30/2019] [Indexed: 12/01/2022]
Abstract
The argonaute protein (Ago) exists in almost all organisms. In eukaryotes, it functions as a regulatory system for gene expression. In prokaryotes, it is a type of defense system against foreign invasive genomes. The Ago system has been engineered for gene silencing and genome editing and plays an important role in biological studies. With an increasing number of genomes and proteomes of various microbes becoming available, computational tools for identifying and annotating argonaute proteins are urgently needed. We introduce AGONOTES (Argonaute Notes). It is a web service especially designed for identifying and annotating Ago. AGONOTES uses the BLASTP similarity search algorithm to categorize all submitted proteins into three groups: prokaryotic argonaute protein (pAgo), eukaryotic argonaute protein (eAgo), and non-argonaute protein (non-Ago). Argonaute proteins can then be aligned to the corresponding standard set of Ago sequences using the multiple sequence alignment program MUSCLE. All functional domains of Ago can further be curated from the alignment results and visualized easily through Bio::Graphic modules in the BioPerl bundle. Compared with existing tools such as CD-Search and available databases such as UniProt and AGONOTES showed a much better performance on domain annotations, which is fundamental in studying the new Ago. AGONOTES can be freely accessed at http://i.uestc.edu.cn/agonotes/. AGONOTES is a friendly tool for annotating Ago domains from a proteome or a series of protein sequences.
Collapse
Affiliation(s)
- Lixu Jiang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China
| | - Min Yu
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China
| | - Yuwei Zhou
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China
| | - Zhongjie Tang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China
| | - Ning Li
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China
| | - Juanjuan Kang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China
| | - Bifang He
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China.,School of Medicine, Guizhou University, Guiyang, China
| | - Jian Huang
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 637111, China.
| |
Collapse
|
21
|
Li SH, Guan ZX, Zhang D, Zhang ZM, Huang J, Yang W, Lin H. Recent Advancement in Predicting Subcellular Localization of Mycobacterial Protein with Machine Learning Methods. Med Chem 2019; 16:605-619. [PMID: 31584379 DOI: 10.2174/1573406415666191004101913] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Revised: 06/25/2019] [Accepted: 08/23/2019] [Indexed: 01/28/2023]
Abstract
Mycobacterium tuberculosis (MTB) can cause the terrible tuberculosis (TB), which is reported as one of the most dreadful epidemics. Although many biochemical molecular drugs have been developed to cope with this disease, the drug resistance-especially the multidrug-resistant (MDR) and extensively drug-resistance (XDR)-poses a huge threat to the treatment. However, traditional biochemical experimental method to tackle TB is time-consuming and costly. Benefited by the appearance of the enormous genomic and proteomic sequence data, TB can be treated via sequence-based biological computational approach-bioinformatics. Studies on predicting subcellular localization of mycobacterial protein (MBP) with high precision and efficiency may help figure out the biological function of these proteins and then provide useful insights for protein function annotation as well as drug design. In this review, we reported the progress that has been made in computational prediction of subcellular localization of MBP including the following aspects: 1) Construction of benchmark datasets. 2) Methods of feature extraction. 3) Techniques of feature selection. 4) Application of several published prediction algorithms. 5) The published results. 6) The further study on prediction of subcellular localization of MBP.
Collapse
Affiliation(s)
- Shi-Hao Li
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jian Huang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China.,Development and Planning Department, Inner Mongolia University, Hohhot, P.R. China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
22
|
Tang Z, Chen S, Chen A, He B, Zhou Y, Chai G, Guo F, Huang J. CasPDB: an integrated and annotated database for Cas proteins from bacteria and archaea. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5549733. [PMID: 31411686 PMCID: PMC6693189 DOI: 10.1093/database/baz093] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 05/01/2019] [Accepted: 06/21/2019] [Indexed: 12/04/2022]
Abstract
Clustered regularly interspaced short palindromic repeats (CRISPR) and associated proteins (Cas) constitute CRISPR–Cas systems, which are antiphage immune systems present in numerous bacterial and most archaeal species. In recent years, CRISPR–Cas systems have been developed into reliable and powerful genome editing tools. Nevertheless, finding similar or better tools from bacteria or archaea remains crucial. This requires the exploration of different CRISPR systems, identification and characterization new Cas proteins. Archives tailored for Cas proteins are urgently needed and necessitate the prediction and grouping of Cas proteins into an information center with all available experimental evidence. Here, we constructed Cas Protein Data Bank (CasPDB), an integrated and annotated online database for Cas proteins from bacteria and archaea. The CasPDB database contains 287 reviewed Cas proteins, 257 745 putative Cas proteins and 3593 Cas operons from 32 023 bacteria species and 1802 archaea species. The database can be freely browsed and searched. The CasPDB web interface also represents all the 3593 putative Cas operons and its components. Among these operons, 328 are members of the type II CRISPR–Cas system.
Collapse
Affiliation(s)
- Zhongjie Tang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - ShaoQi Chen
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ang Chen
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Bifang He
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China.,School of Medicine, Guizhou University, Guiyang 550025, China
| | - Yuwei Zhou
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Guoshi Chai
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - FengBiao Guo
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jian Huang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 611731, China
| |
Collapse
|
23
|
Couvin D, Bernheim A, Toffano-Nioche C, Touchon M, Michalik J, Néron B, Rocha EPC, Vergnaud G, Gautheret D, Pourcel C. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res 2018; 46:W246-W251. [PMID: 29790974 PMCID: PMC6030898 DOI: 10.1093/nar/gky425] [Citation(s) in RCA: 913] [Impact Index Per Article: 130.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 05/09/2018] [Indexed: 12/25/2022] Open
Abstract
CRISPR (clustered regularly interspaced short palindromic repeats) arrays and their associated (Cas) proteins confer bacteria and archaea adaptive immunity against exogenous mobile genetic elements, such as phages or plasmids. CRISPRCasFinder allows the identification of both CRISPR arrays and Cas proteins. The program includes: (i) an improved CRISPR array detection tool facilitating expert validation based on a rating system, (ii) prediction of CRISPR orientation and (iii) a Cas protein detection and typing tool updated to match the latest classification scheme of these systems. CRISPRCasFinder can either be used online or as a standalone tool compatible with Linux operating system. All third-party software packages employed by the program are freely available. CRISPRCasFinder is available at https://crisprcas.i2bc.paris-saclay.fr.
Collapse
Affiliation(s)
- David Couvin
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Aude Bernheim
- Microbial Evolutionary Genomics, Institut Pasteur, 25-28 rue du Docteur Roux, 75015, Paris, France
- CNRS, UMR3525, 25-28 rue du Docteur Roux, 75015, Paris, France
| | - Claire Toffano-Nioche
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Marie Touchon
- Microbial Evolutionary Genomics, Institut Pasteur, 25-28 rue du Docteur Roux, 75015, Paris, France
- CNRS, UMR3525, 25-28 rue du Docteur Roux, 75015, Paris, France
| | - Juraj Michalik
- Université Lille 1, CRIStAL, équipe Bonsai, Cité Scientifique Bat M3, 59655 Villeneuve d'Ascq Cedex, France
| | - Bertrand Néron
- Bioinformatics and Biostatistics Hub - C3BI, USR 3756 IP CNRS - Paris, Institut Pasteur, 25-28 rue du Docteur Roux, 75015, France
| | - Eduardo P C Rocha
- Microbial Evolutionary Genomics, Institut Pasteur, 25-28 rue du Docteur Roux, 75015, Paris, France
- CNRS, UMR3525, 25-28 rue du Docteur Roux, 75015, Paris, France
| | - Gilles Vergnaud
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| | - Christine Pourcel
- Institute for Integrative Biology of the Cell (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette, France
| |
Collapse
|
24
|
You Q, Zhong Z, Ren Q, Hassan F, Zhang Y, Zhang T. CRISPRMatch: An Automatic Calculation and Visualization Tool for High-throughput CRISPR Genome-editing Data Analysis. Int J Biol Sci 2018; 14:858-862. [PMID: 29989077 PMCID: PMC6036748 DOI: 10.7150/ijbs.24581] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2017] [Accepted: 02/28/2018] [Indexed: 01/05/2023] Open
Abstract
Custom-designed nucleases, including CRISPR-Cas9 and CRISPR-Cpf1, are widely used to realize the precise genome editing. The high-coverage, low-cost and quantifiability make high-throughput sequencing (NGS) to be an effective method to assess the efficiency of custom-designed nucleases. However, contrast to standardized transcriptome protocol, the NGS data lacks a user-friendly pipeline connecting different tools that can automatically calculate mutation, evaluate editing efficiency and realize in a more comprehensive dataset that can be visualized. Here, we have developed an automatic stand-alone toolkit based on python script, namely CRISPRMatch, to process the high-throughput genome-editing data of CRISPR nuclease transformed protoplasts by integrating analysis steps like mapping reads and normalizing reads count, calculating mutation frequency (deletion and insertion), evaluating efficiency and accuracy of genome-editing, and visualizing the results (tables and figures). Both of CRISPR-Cas9 and CRISPR-Cpf1 nucleases are supported by CRISPRMatch toolkit and the integrated code has been released on GitHub (https://github.com/zhangtaolab/CRISPRMatch).
Collapse
Affiliation(s)
- Qi You
- Jiangsu Key Laboratory of Crop Genetics and Physiology, Co-Innovation Centre for Modern Production Technology of Grain Crops, Key Laboratory of Plant Functional Genomics of the Ministry of Education, Yangzhou University, Yangzhou 225009, China.,Joint International Research Laboratory of Agriculture and Agri-Product Safety, the Ministry of Education of China, Yangzhou University, Yangzhou 225009, China
| | - Zhaohui Zhong
- Department of Biotechnology, School of Life Science and Technology, Centre for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qiurong Ren
- Department of Biotechnology, School of Life Science and Technology, Centre for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fakhrul Hassan
- Department of Biotechnology, School of Life Science and Technology, Centre for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yong Zhang
- Department of Biotechnology, School of Life Science and Technology, Centre for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Tao Zhang
- Jiangsu Key Laboratory of Crop Genetics and Physiology, Co-Innovation Centre for Modern Production Technology of Grain Crops, Key Laboratory of Plant Functional Genomics of the Ministry of Education, Yangzhou University, Yangzhou 225009, China.,Joint International Research Laboratory of Agriculture and Agri-Product Safety, the Ministry of Education of China, Yangzhou University, Yangzhou 225009, China
| |
Collapse
|
25
|
Schmid M, Muri J, Melidis D, Varadarajan AR, Somerville V, Wicki A, Moser A, Bourqui M, Wenzel C, Eugster-Meier E, Frey JE, Irmler S, Ahrens CH. Comparative Genomics of Completely Sequenced Lactobacillus helveticus Genomes Provides Insights into Strain-Specific Genes and Resolves Metagenomics Data Down to the Strain Level. Front Microbiol 2018; 9:63. [PMID: 29441050 PMCID: PMC5797582 DOI: 10.3389/fmicb.2018.00063] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 01/10/2018] [Indexed: 11/20/2022] Open
Abstract
Although complete genome sequences hold particular value for an accurate description of core genomes, the identification of strain-specific genes, and as the optimal basis for functional genomics studies, they are still largely underrepresented in public repositories. Based on an assessment of the genome assembly complexity for all lactobacilli, we used Pacific Biosciences' long read technology to sequence and de novo assemble the genomes of three Lactobacillus helveticus starter strains, raising the number of completely sequenced strains to 12. The first comparative genomics study for L. helveticus—to our knowledge—identified a core genome of 988 genes and sets of unique, strain-specific genes ranging from about 30 to more than 200 genes. Importantly, the comparison of MiSeq- and PacBio-based assemblies uncovered that not only accessory but also core genes can be missed in incomplete genome assemblies based on short reads. Analysis of the three genomes revealed that a large number of pseudogenes were enriched for functional Gene Ontology categories such as amino acid transmembrane transport and carbohydrate metabolism, which is in line with a reductive genome evolution in the rich natural habitat of L. helveticus. Notably, the functional Clusters of Orthologous Groups of proteins categories “cell wall/membrane biogenesis” and “defense mechanisms” were found to be enriched among the strain-specific genes. A genome mining effort uncovered examples where an experimentally observed phenotype could be linked to the underlying genotype, such as for cell envelope proteinase PrtH3 of strain FAM8627. Another possible link identified for peptidoglycan hydrolases will require further experiments. Of note, strain FAM22155 did not harbor a CRISPR/Cas system; its loss was also observed in other L. helveticus strains and lactobacillus species, thus questioning the value of the CRISPR/Cas system for diagnostic purposes. Importantly, the complete genome sequences proved to be very useful for the analysis of natural whey starter cultures with metagenomics, as a larger percentage of the sequenced reads of these complex mixtures could be unambiguously assigned down to the strain level.
Collapse
Affiliation(s)
- Michael Schmid
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, Wädenswil, Switzerland
| | - Jonathan Muri
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland
| | - Damianos Melidis
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland
| | - Adithi R Varadarajan
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, Wädenswil, Switzerland
| | - Vincent Somerville
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, Wädenswil, Switzerland
| | - Adrian Wicki
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland
| | - Aline Moser
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, Bern, Switzerland
| | - Marc Bourqui
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, Wädenswil, Switzerland
| | - Claudia Wenzel
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, Bern, Switzerland
| | - Elisabeth Eugster-Meier
- School of Agricultural, Forest and Food Sciences HAFL, Bern University of Applied Sciences, Zollikofen, Switzerland
| | - Juerg E Frey
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland
| | - Stefan Irmler
- Agroscope, Research Group Biochemistry of Milk and Microorganisms, Bern, Switzerland
| | - Christian H Ahrens
- Agroscope, Research Group Molecular Diagnostics, Genomics and Bioinformatics, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics, Wädenswil, Switzerland
| |
Collapse
|
26
|
Dao FY, Yang H, Su ZD, Yang W, Wu Y, Hui D, Chen W, Tang H, Lin H. Recent Advances in Conotoxin Classification by Using Machine Learning Methods. Molecules 2017; 22:molecules22071057. [PMID: 28672838 PMCID: PMC6152242 DOI: 10.3390/molecules22071057] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Revised: 06/12/2017] [Accepted: 06/19/2017] [Indexed: 11/16/2022] Open
Abstract
Conotoxins are disulfide-rich small peptides, which are invaluable peptides that target ion channel and neuronal receptors. Conotoxins have been demonstrated as potent pharmaceuticals in the treatment of a series of diseases, such as Alzheimer's disease, Parkinson's disease, and epilepsy. In addition, conotoxins are also ideal molecular templates for the development of new drug lead compounds and play important roles in neurobiological research as well. Thus, the accurate identification of conotoxin types will provide key clues for the biological research and clinical medicine. Generally, conotoxin types are confirmed when their sequence, structure, and function are experimentally validated. However, it is time-consuming and costly to acquire the structure and function information by using biochemical experiments. Therefore, it is important to develop computational tools for efficiently and effectively recognizing conotoxin types based on sequence information. In this work, we reviewed the current progress in computational identification of conotoxins in the following aspects: (i) construction of benchmark dataset; (ii) strategies for extracting sequence features; (iii) feature selection techniques; (iv) machine learning methods for classifying conotoxins; (v) the results obtained by these methods and the published tools; and (vi) future perspectives on conotoxin classification. The paper provides the basis for in-depth study of conotoxins and drug therapy research.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hui Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen-Dong Su
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wuritu Yang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Development and Planning Department, Inner Mongolia University, Hohhot 010021, China.
| | - Yun Wu
- College of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China.
| | - Ding Hui
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Wei Chen
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
- Department of Physics, School of Sciences, and Center for Genomics and Computational Biology, North China University of Science and Technology, Tangshan 063000, China.
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
27
|
Pourcel C. [An history of the CRISPR-Cas systems discovery]. Biol Aujourdhui 2017; 211:247-254. [PMID: 29956651 DOI: 10.1051/jbio/2018001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Indexed: 12/26/2022]
Abstract
From 1987 and during the following 20 years, a few research teams exploring bacteria and archea genome sequences uncover the prokaryotic adaptative immune system made of the CRISPR sequence and associated cas genes. First believed to be similar to the eukaryote RNA interference system, CRISPR-Cas turned out to be unique and of an amazing genetic complexity. The comparative studies of CRISPR arrays and of cas, and later of microbiotes metagenomes allowed to propose an evolution scenario for these systems. The results demonstrate the importance of a naturalistic approach, without a priori, for the understanding of living organisms.
Collapse
Affiliation(s)
- Christine Pourcel
- Institut de Biologie Intégrative de la Cellule (I2BC), CEA, CNRS, Univ. Paris-Sud, Université Paris-Saclay, 91198 Gif-sur-Yvette cedex, France
| |
Collapse
|