1
|
Arıkan M, Atabay B. Construction of Protein Sequence Databases for Metaproteomics: A Review of the Current Tools and Databases. J Proteome Res 2024; 23:5250-5262. [PMID: 39449618 DOI: 10.1021/acs.jproteome.4c00665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2024]
Abstract
In metaproteomics studies, constructing a reference protein sequence database that is both comprehensive and not overly large is critical for the peptide identification step. Therefore, the availability of well-curated reference databases and tools for custom database construction is essential to enhance the performance of metaproteomics analyses. In this review, we first provide an overview of metaproteomics by presenting a concise historical background, outlining a typical experimental and bioinformatics workflow, emphasizing the crucial step of constructing a protein sequence database for metaproteomics. We then delve into the current tools available for building such databases, highlighting their individual approaches, utility, and advantages and limitations. Next, we examine existing protein sequence databases, detailing their scope and relevance in metaproteomics research. Then, we provide practical recommendations for constructing protein sequence databases for metaproteomics, along with an overview of the current challenges in this area. We conclude with a discussion of anticipated advancements, emerging trends, and future directions in the construction of protein sequence databases for metaproteomics.
Collapse
Affiliation(s)
- Muzaffer Arıkan
- Biotechnology Division, Department of Biology, Faculty of Science, Istanbul University, Istanbul 34134, Türkiye
| | - Başak Atabay
- Department of Biomedical Engineering, School of Engineering and Natural Sciences, Istanbul Medipol University, Istanbul 34810, Türkiye
| |
Collapse
|
2
|
Wu E, Mallawaarachchi V, Zhao J, Yang Y, Liu H, Wang X, Shen C, Lin Y, Qiao L. Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics. MICROBIOME 2024; 12:58. [PMID: 38504332 PMCID: PMC10949615 DOI: 10.1186/s40168-024-01775-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 02/05/2024] [Indexed: 03/21/2024]
Abstract
BACKGROUND Microbiota are closely associated with human health and disease. Metaproteomics can provide a direct means to identify microbial proteins in microbiota for compositional and functional characterization. However, in-depth and accurate metaproteomics is still limited due to the extreme complexity and high diversity of microbiota samples. It is generally recommended to use metagenomic data from the same samples to construct the protein sequence database for metaproteomic data analysis. Although different metagenomics-based database construction strategies have been developed, an optimization of gene taxonomic annotation has not been reported, which, however, is extremely important for accurate metaproteomic analysis. RESULTS Herein, we proposed an accurate taxonomic annotation pipeline for genes from metagenomic data, namely contigs directed gene annotation (ConDiGA), and used the method to build a protein sequence database for metaproteomic analysis. We compared our pipeline (ConDiGA or MD3) with two other popular annotation pipelines (MD1 and MD2). In MD1, genes were directly annotated against the whole bacterial genome database; in MD2, contigs were annotated against the whole bacterial genome database and the taxonomic information of contigs was assigned to the genes; in MD3, the most confident species from the contigs annotation results were taken as reference to annotate genes. Annotation tools, including BLAST, Kaiju, and Kraken2, were compared. Based on a synthetic microbial community of 12 species, it was found that Kaiju with the MD3 pipeline outperformed the others in the construction of protein sequence database from metagenomic data. Similar performance was also observed with a fecal sample, as well as in silico mixed datasets of the simulated microbial community and the fecal sample. CONCLUSIONS Overall, we developed an optimized pipeline for gene taxonomic annotation to construct protein sequence databases. Our study can tackle the current taxonomic annotation reliability problem in metagenomics-derived protein sequence database and can promote the in-depth metaproteomic analysis of microbiome. The unique metagenomic and metaproteomic datasets of the 12 bacterial species are publicly available as a standard benchmarking sample for evaluating various analysis pipelines. The code of ConDiGA is open access at GitHub for the analysis of microbiota samples. Video Abstract.
Collapse
Affiliation(s)
- Enhui Wu
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Vijini Mallawaarachchi
- School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Bedford Park, SA, 5042, Australia
| | - Jinzhi Zhao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Yi Yang
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China
| | - Hebin Liu
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Xiaoqing Wang
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Chengpin Shen
- Shanghai Omicsolution Co., Ltd, Shanghai, 200000, China
| | - Yu Lin
- School of Computing, College of Engineering, Computing and Cybernetics, The Australian National University, Canberra, ACT, 2600, Australia
| | - Liang Qiao
- Department of Chemistry, and Shanghai Stomatological Hospital, Fudan University, Shanghai, 200000, China.
| |
Collapse
|
3
|
Porcheddu M, Abbondio M, De Diego L, Uzzau S, Tanca A. Meta4P: A User-Friendly Tool to Parse Label-Free Quantitative Metaproteomic Data and Taxonomic/Functional Annotations. J Proteome Res 2023. [PMID: 37116187 DOI: 10.1021/acs.jproteome.2c00803] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/30/2023]
Abstract
We present Meta4P (MetaProteins-Peptides-PSMs Parser), an easy-to-use bioinformatic application designed to integrate label-free quantitative metaproteomic data with taxonomic and functional annotations. Meta4P can retrieve, filter, and process identification and quantification data from three levels of inputs (proteins, peptides, PSMs) in different file formats. Abundance data can be combined with taxonomic and functional information and aggregated at different and customizable levels, including taxon-specific functions and pathways. Meta4P output tables, available in various formats, are ready to be used as inputs for downstream statistical analyses. This user-friendly tool is expected to provide a useful contribution to the field of metaproteomic data analysis, helping make it more manageable and straightforward.
Collapse
Affiliation(s)
- Massimo Porcheddu
- Department of Biomedical Sciences, University of Sassari, Viale San Pietro 43/B, 07100 Sassari, Italy
| | - Marcello Abbondio
- Department of Biomedical Sciences, University of Sassari, Viale San Pietro 43/B, 07100 Sassari, Italy
| | - Laura De Diego
- Department of Biomedical Sciences, University of Sassari, Viale San Pietro 43/B, 07100 Sassari, Italy
| | - Sergio Uzzau
- Department of Biomedical Sciences, University of Sassari, Viale San Pietro 43/B, 07100 Sassari, Italy
| | - Alessandro Tanca
- Department of Biomedical Sciences, University of Sassari, Viale San Pietro 43/B, 07100 Sassari, Italy
| |
Collapse
|