Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

117
(from Reference Citation Analysis)

Article PDFs (60)

Cited by > 0 (106)

Searched Name

Juan Antonio Vizcaíno

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Number	Citation Analysis
1	Open-source large language models in action: A bioinformatics chatbot for PRIDE database. Proteomics 2024:e2400005. [PMID: 38556628 DOI: 10.1002/pmic.202400005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 03/08/2024] [Accepted: 03/20/2024] [Indexed: 04/02/2024] Abstract We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database's documentation and dataset search functionality. The framework utilizes multiple Large Language Models (LLM): llama2, chatglm, mixtral (mistral), and openhermes. It also includes a web service API (Application Programming Interface), web interface, and components for indexing and managing vector databases. An Elo-ranking system-based benchmark component is included in the framework as well, which allows for evaluating the performance of each LLM and for improving PRIDE documentation. The chatbot not only allows users to interact with PRIDE documentation but can also be used to search and find PRIDE datasets using an LLM-based recommendation system, enabling dataset discoverability. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure. The framework is open-source (https://github.com/PRIDE-Archive/pride-chatbot). Collapse Key Words bioinformatics dataset discoverability documentation large language models proteomics public data software architectures training Collapse MESH Headings Collapse Grants 223745/Z/21/Z Wellcome Trust BB/S01781X/1 Biotechnology and Biological Sciences Research Council Collapse
2	TopDownApp: An open and modular platform for analysis and visualisation of top-down proteomics data. Proteomics 2024;24:e2200403. [PMID: 37787899 DOI: 10.1002/pmic.202200403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/13/2023] [Accepted: 09/13/2023] [Indexed: 10/04/2023] Abstract Although Top-down (TD) proteomics techniques, aimed at the analysis of intact proteins and proteoforms, are becoming increasingly popular, efforts are needed at different levels to generalise their adoption. In this context, there are numerous improvements that are possible in the area of open science practices, including a greater application of the FAIR (Findable, Accessible, Interoperable, and Reusable) data principles. These include, for example, increased data sharing practices and readily available open data standards. Additionally, the field would benefit from the development of open data analysis workflows that can enable data reuse of public datasets, something that is increasingly common in other proteomics fields. Collapse Key Words open analysis pipeline open science top-down proteomics visualisation Collapse MESH Headings Proteomics/methods Proteins/analysis Workflow Collapse Grants 829157 Horizon 2020 Framework Programme 823839 European Proteomics Infrastructure Consortium providing access Collapse
3	Expression Atlas update: insights from sequencing data at both bulk and single cell level. Nucleic Acids Res 2024;52:D107-D114. [PMID: 37992296 PMCID: PMC10767917 DOI: 10.1093/nar/gkad1021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/13/2023] [Accepted: 10/30/2023] [Indexed: 11/24/2023] Open Abstract Expression Atlas (www.ebi.ac.uk/gxa) and its newest counterpart the Single Cell Expression Atlas (www.ebi.ac.uk/gxa/sc) are EMBL-EBI's knowledgebases for gene and protein expression and localisation in bulk and at single cell level. These resources aim to allow users to investigate their expression in normal tissue (baseline) or in response to perturbations such as disease or changes to genotype (differential) across multiple species. Users are invited to search for genes or metadata terms across species or biological conditions in a standardised consistent interface. Alongside these data, new features in Single Cell Expression Atlas allow users to query metadata through our new cell type wheel search. At the experiment level data can be explored through two types of dimensionality reduction plots, t-distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and Projection (UMAP), overlaid with either clustering or metadata information to assist users' understanding. Data are also visualised as marker gene heatmaps identifying genes that help confer cluster identity. For some data, additional visualisations are available as interactive cell level anatomograms and cell type gene expression heatmaps. Collapse Key Words Collapse MESH Headings Databases, Genetic Genotype Metadata Proteomics Single-Cell Analysis Internet Humans Animals Gene Expression Profiling Collapse Grants Wellcome Trust 108437/Z/15/Z Wellcome Trust European Molecular Biology Laboratory BBSRC Fly Cell Atlas Gramene Collapse
4	WOMBAT-P: Benchmarking Label-Free Proteomics Data Analysis Workflows. J Proteome Res 2024;23:418-429. [PMID: 38038272 DOI: 10.1021/acs.jproteome.3c00636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023] Abstract The inherent diversity of approaches in proteomics research has led to a wide range of software solutions for data analysis. These software solutions encompass multiple tools, each employing different algorithms for various tasks such as peptide-spectrum matching, protein inference, quantification, statistical analysis, and visualization. To enable an unbiased comparison of commonly used bottom-up label-free proteomics workflows, we introduce WOMBAT-P, a versatile platform designed for automated benchmarking and comparison. WOMBAT-P simplifies the processing of public data by utilizing the sample and data relationship format for proteomics (SDRF-Proteomics) as input. This feature streamlines the analysis of annotated local or public ProteomeXchange data sets, promoting efficient comparisons among diverse outputs. Through an evaluation using experimental ground truth data and a realistic biological data set, we uncover significant disparities and a limited overlap in the quantified proteins. WOMBAT-P not only enables rapid execution and seamless comparison of workflows but also provides valuable insights into the capabilities of different software solutions. These benchmarking metrics are a valuable resource for researchers in selecting the most suitable workflow for their specific data sets. The modular architecture of WOMBAT-P promotes extensibility and customization. The software is available at https://github.com/wombat-p/WOMBAT-Pipelines. Collapse Key Words benchmarking data analysis label-free proteomics quality metrics workflow Collapse MESH Headings Workflow Proteomics Benchmarking Software Proteins Data Analysis Collapse Grants Collapse
5	Integrated meta-analysis of colorectal cancer public proteomic datasets for biomarker discovery and validation. PLoS Comput Biol 2024;20:e1011828. [PMID: 38252632 PMCID: PMC10833860 DOI: 10.1371/journal.pcbi.1011828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/01/2024] [Accepted: 01/15/2024] [Indexed: 01/24/2024] Open Abstract The cancer biomarker field has been an object of thorough investigation in the last decades. Despite this, colorectal cancer (CRC) heterogeneity makes it challenging to identify and validate effective prognostic biomarkers for patient classification according to outcome and treatment response. Although a massive amount of proteomics data has been deposited in public data repositories, this rich source of information is vastly underused. Here, we attempted to reuse public proteomics datasets with two main objectives: i) to generate hypotheses (detection of biomarkers) for their posterior/downstream validation, and (ii) to validate, using an orthogonal approach, a previously described biomarker panel. Twelve CRC public proteomics datasets (mostly from the PRIDE database) were re-analysed and integrated to create a landscape of protein expression. Samples from both solid and liquid biopsies were included in the reanalysis. Integrating this data with survival annotation data, we have validated in silico a six-gene signature for CRC classification at the protein level, and identified five new blood-detectable biomarkers (CD14, PPIA, MRC2, PRDX1, and TXNDC5) associated with CRC prognosis. The prognostic value of these blood-derived proteins was confirmed using additional public datasets, supporting their potential clinical value. As a conclusion, this proof-of-the-concept study demonstrates the value of re-using public proteomics datasets as the basis to create a useful resource for biomarker discovery and validation. The protein expression data has been made available in the public resource Expression Atlas. Collapse Key Words Collapse MESH Headings Humans Proteomics Colorectal Neoplasms/diagnosis Colorectal Neoplasms/genetics Colorectal Neoplasms/metabolism Biomarkers, Tumor/metabolism Blood Proteins Protein Disulfide-Isomerases Collapse Grants Ministerio de Ciencia e Innovación BBSRC EMBL Comunidad de Madrid Collapse
6	A meta-analysis of rice phosphoproteomics data to understand variation in cell signalling across the rice pan-genome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.17.567512. [PMID: 38014076 PMCID: PMC10680829 DOI: 10.1101/2023.11.17.567512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023] Abstract Phosphorylation is the most studied post-translational modification, and has multiple biological functions. In this study, we have re-analysed publicly available mass spectrometry proteomics datasets enriched for phosphopeptides from Asian rice (Oryza sativa). In total we identified 15,522 phosphosites on serine, threonine and tyrosine residues on rice proteins. We identified sequence motifs for phosphosites, and link motifs to enrichment of different biological processes, indicating different downstream regulation likely caused by different kinase groups. We cross-referenced phosphosites against the rice 3,000 genomes, to identify single amino acid variations (SAAVs) within or proximal to phosphosites that could cause loss of a site in a given rice variety. The data was clustered to identify groups of sites with similar patterns across rice family groups, for example those highly conserved in Japonica, but mostly absent in Aus type rice varieties - known to have different responses to drought. These resources can assist rice researchers to discover alleles with significantly different functional effects across rice varieties. The data has been loaded into UniProt Knowledge-Base - enabling researchers to visualise sites alongside other data on rice proteins e.g. structural models from AlphaFold2, PeptideAtlas and the PRIDE database - enabling visualisation of source evidence, including scores and supporting mass spectra. Collapse Key Words Collapse MESH Headings Collapse Grants Wellcome Trust R01 GM087221 NIGMS NIH HHS R24 GM148372 NIGMS NIH HHS Collapse
7	Foresight in clinical proteomics: current status, ethical considerations, and future perspectives. OPEN RESEARCH EUROPE 2023;3:59. [PMID: 37645494 PMCID: PMC10446044 DOI: 10.12688/openreseurope.15810.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 10/26/2023] [Indexed: 08/31/2023] Abstract With the advent of robust and high-throughput mass spectrometric technologies and bioinformatics tools to analyze large data sets, proteomics has penetrated broadly into basic and translational life sciences research. More than 95% of FDA-approved drugs currently target proteins, and most diagnostic tests are protein-based. The introduction of proteomics to the clinic, for instance to guide patient stratification and treatment, is already ongoing. Importantly, ethical challenges come with this success, which must also be adequately addressed by the proteomics and medical communities. Consortium members of the H2020 European Union-funded proteomics initiative: European Proteomics Infrastructure Consortium-providing access (EPIC-XS) met at the Core Technologies for Life Sciences (CTLS) conference to discuss the emerging role and implementation of proteomics in the clinic. The discussion, involving leaders in the field, focused on the current status, related challenges, and future efforts required to make proteomics a more mainstream technology for translational and clinical research. Here we report on that discussion and provide an expert update concerning the feasibility of clinical proteomics, the ethical implications of generating and analyzing large-scale proteomics clinical data, and recommendations to ensure both ethical and effective implementation in real-world applications. Collapse Key Words Clinical proteomics; clinical research; ethical challenges Collapse MESH Headings Collapse Grants Wellcome Trust Collapse
8	lesSDRF is more: maximizing the value of proteomics data through streamlined metadata annotation. Nat Commun 2023;14:6743. [PMID: 37875519 PMCID: PMC10598006 DOI: 10.1038/s41467-023-42543-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 10/13/2023] [Indexed: 10/26/2023] Open Abstract Public proteomics data often lack essential metadata, limiting its potential. To address this, we present lesSDRF, a tool to simplify the process of metadata annotation, thereby ensuring that data leave a lasting, impactful legacy well beyond its initial publication. Collapse Key Words research data translational research protein databases proteomics Collapse MESH Headings Proteomics Metadata Collapse Grants Wellcome Trust EC \| Horizon 2020 Framework Programme (EU Framework Programme for Research and Innovation H2020) Fonds Wetenschappelijk Onderzoek (Research Foundation Flanders) Universiteit Gent (UGent) Wellcome Trust (Wellcome) RCUK \| Biotechnology and Biological Sciences Research Council (BBSRC) Collapse
9	Toward an Integrated Machine Learning Model of a Proteomics Experiment. J Proteome Res 2023;22:681-696. [PMID: 36744821 PMCID: PMC9990124 DOI: 10.1021/acs.jproteome.2c00711] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Abstract In recent years machine learning has made extensive progress in modeling many aspects of mass spectrometry data. We brought together proteomics data generators, repository managers, and machine learning experts in a workshop with the goals to evaluate and explore machine learning applications for realistic modeling of data from multidimensional mass spectrometry-based proteomics analysis of any sample or organism. Following this sample-to-data roadmap helped identify knowledge gaps and define needs. Being able to generate bespoke and realistic synthetic data has legitimate and important uses in system suitability, method development, and algorithm benchmarking, while also posing critical ethical questions. The interdisciplinary nature of the workshop informed discussions of what is currently possible and future opportunities and challenges. In the following perspective we summarize these discussions in the hope of conveying our excitement about the potential of machine learning in proteomics and to inspire future research. Collapse Key Words artificial intelligence deep learning enzymatic digestion ion mobility liquid chromatography machine learning research integrity synthetic data tandem mass spectrometry Collapse MESH Headings Collapse Grants Collapse
10	Integrated View of Baseline Protein Expression in Human Tissues. J Proteome Res 2023;22:729-742. [PMID: 36577097 PMCID: PMC9990129 DOI: 10.1021/acs.jproteome.2c00406] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Abstract The availability of proteomics datasets in the public domain, and in the PRIDE database, in particular, has increased dramatically in recent years. This unprecedented large-scale availability of data provides an opportunity for combined analyses of datasets to get organism-wide protein abundance data in a consistent manner. We have reanalyzed 24 public proteomics datasets from healthy human individuals to assess baseline protein abundance in 31 organs. We defined tissue as a distinct functional or structural region within an organ. Overall, the aggregated dataset contains 67 healthy tissues, corresponding to 3,119 mass spectrometry runs covering 498 samples from 489 individuals. We compared protein abundances between different organs and studied the distribution of proteins across these organs. We also compared the results with data generated in analogous studies. Additionally, we performed gene ontology and pathway-enrichment analyses to identify organ-specific enriched biological processes and pathways. As a key point, we have integrated the protein abundance results into the resource Expression Atlas, where they can be accessed and visualized either individually or together with gene expression data coming from transcriptomics datasets. We believe this is a good mechanism to make proteomics data more accessible for life scientists. Collapse Key Words human proteome mass spectrometry public data re-use quantitative proteomics Collapse MESH Headings Collapse Grants Collapse
11	Proteomics Standards Initiative at Twenty Years: Current Activities and Future Work. J Proteome Res 2023;22:287-301. [PMID: 36626722 PMCID: PMC9903322 DOI: 10.1021/acs.jproteome.2c00637] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Indexed: 01/11/2023] Abstract The Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) has been successfully developing guidelines, data formats, and controlled vocabularies (CVs) for the proteomics community and other fields supported by mass spectrometry since its inception 20 years ago. Here we describe the general operation of the PSI, including its leadership, working groups, yearly workshops, and the document process by which proposals are thoroughly and publicly reviewed in order to be ratified as PSI standards. We briefly describe the current state of the many existing PSI standards, some of which remain the same as when originally developed, some of which have undergone subsequent revisions, and some of which have become obsolete. Then the set of proposals currently being developed are described, with an open call to the community for participation in the forging of the next generation of standards. Finally, we describe some synergies and collaborations with other organizations and look to the future in how the PSI will continue to promote the open sharing of data and thus accelerate the progress of the field of proteomics. Collapse Key Words Human Proteome Organization Proteomics Standards Initiative mass spectrometry molecular interactions proteomics standards Collapse MESH Headings Humans Proteomics Proteome Reference Standards Vocabulary, Controlled Mass Spectrometry Databases, Protein Collapse Grants R24 GM127667 NIGMS NIH HHS R01 GM087221 NIGMS NIH HHS U24 HG007822 NHGRI NIH HHS BB/T019557/1 Biotechnology and Biological Sciences Research Council BB/L024225/1 Biotechnology and Biological Sciences Research Council 223745/Z/21/Z Wellcome Trust BB/L024128/1 Biotechnology and Biological Sciences Research Council U19 AG023122 NIA NIH HHS BB/V018779/1 Biotechnology and Biological Sciences Research Council BB/N022440/1 Biotechnology and Biological Sciences Research Council R01 LM013115 NLM NIH HHS BB/T019670/1 Biotechnology and Biological Sciences Research Council BB/K01997X/1 Biotechnology and Biological Sciences Research Council BB/S01781X/1 Biotechnology and Biological Sciences Research Council BB/R02216X/1 Biotechnology and Biological Sciences Research Council Wellcome Trust 208391/Z/17/Z Wellcome Trust Wellcome H2020 Marie Sklodowska-Curie Actions Office of the Director Division of Biological Infrastructure Chinese National Infrastructure for Protein Science Division of Integrative Organismal Systems Ministry of Science and Technology of the People''s Republic of China Fonds Wetenschappelijk Onderzoek European Molecular Biology Laboratory U.S. National Library of Medicine National Institute of Diabetes and Digestive and Kidney Diseases Bundesministerium fÃ¼r Bildung und Forschung National Institute of Allergy and Infectious Diseases National Institute of General Medical Sciences National Eye Institute National Human Genome Research Institute National Institute on Aging Japan Society for the Promotion of Science Japan Science and Technology Agency Collapse
12	ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res 2023;22:632-636. [PMID: 36693629 PMCID: PMC9903315 DOI: 10.1021/acs.jproteome.2c00629] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Abstract Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at https://www.proteomicsml.org/, and we welcome the entire proteomics community to contribute to the project at https://github.com/ProteomicsML/ProteomicsML. Collapse Key Words bioinformatics community platform deep learning educational platform machine learning proteomics Collapse MESH Headings Collapse Grants Collapse
13	Identifying individuals using proteomics: are we there yet? Front Mol Biosci 2022;9:1062031. [PMID: 36523653 PMCID: PMC9744771 DOI: 10.3389/fmolb.2022.1062031] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 11/16/2022] [Indexed: 08/31/2023] Open Abstract Multi-omics approaches including proteomics analyses are becoming an integral component of precision medicine. As clinical proteomics studies gain momentum and their sensitivity increases, research on identifying individuals based on their proteomics data is here examined for risks and ethics-related issues. A great deal of work has already been done on this topic for DNA/RNA sequencing data, but it has yet to be widely studied in other omics fields. The current state-of-the-art for the identification of individuals based solely on proteomics data is explained. Protein sequence variation analysis approaches are covered in more detail, including the available analysis workflows and their limitations. We also outline some previous forensic and omics proteomics studies that are relevant for the identification of individuals. Following that, we discuss the risks of patient reidentification using other proteomics data types such as protein expression abundance and post-translational modification (PTM) profiles. In light of the potential identification of individuals through proteomics data, possible legal and ethical implications are becoming increasingly important in the field. Collapse Key Words RNA editing amino acid variants genomics data identifiability omics data analysis protein variants proteogenomics proteomics data Collapse MESH Headings Collapse Grants Wellcome Collapse
14	Is DIA proteomics data FAIR? Current data sharing practices, available bioinformatics infrastructure and recommendations for the future. Proteomics 2022;23:e2200014. [PMID: 36074795 PMCID: PMC10155627 DOI: 10.1002/pmic.202200014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/27/2022] [Accepted: 08/29/2022] [Indexed: 11/06/2022] Abstract Data independent acquisition (DIA) proteomics techniques have matured enormously in recent years, thanks to multiple technical developments in e.g. instrumentation and data analysis approaches. However, there are many improvements that are still possible for DIA data in the area of the FAIR (Findability, Accessibility, Interoperability and Reusability) data principles. These include more tailored data sharing practices and open data standards, since public databases and data standards for proteomics were mostly designed with DDA data in mind. Here we first describe the current state of the art in the context of FAIR data for proteomics in general, and for DIA approaches in particular. For improving the current situation for DIA data, we make the following recommendations for the future: (i) development of an open data standard for spectral libraries; (ii) make mandatory the availability of the spectral libraries used in DIA experiments in ProteomeXchange resources; (iii) improve the support for DIA data in the data standards developed by the Proteomics Standards Initiative; and (iv) improve the support for DIA datasets in ProteomeXchange resources, including more tailored metadata requirements. This article is protected by copyright. All rights reserved. Collapse Key Words Data Independent Acquisition data repositories data standards proteomics data spectral libraries Collapse MESH Headings Collapse Grants Collapse
15	Implementing the reuse of public DIA proteomics datasets: from the PRIDE database to Expression Atlas. Sci Data 2022;9:335. [PMID: 35701420 PMCID: PMC9197839 DOI: 10.1038/s41597-022-01380-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Accepted: 05/12/2022] [Indexed: 11/14/2022] Open Abstract The number of mass spectrometry (MS)-based proteomics datasets in the public domain keeps increasing, particularly those generated by Data Independent Acquisition (DIA) approaches such as SWATH-MS. Unlike Data Dependent Acquisition datasets, the re-use of DIA datasets has been rather limited to date, despite its high potential, due to the technical challenges involved. We introduce a (re-)analysis pipeline for public SWATH-MS datasets which includes a combination of metadata annotation protocols, automated workflows for MS data analysis, statistical analysis, and the integration of the results into the Expression Atlas resource. Automation is orchestrated with Nextflow, using containerised open analysis software tools, rendering the pipeline readily available and reproducible. To demonstrate its utility, we reanalysed 10 public DIA datasets from the PRIDE database, comprising 1,278 SWATH-MS runs. The robustness of the analysis was evaluated, and the results compared to those obtained in the original publications. The final expression values were integrated into Expression Atlas, making SWATH-MS experiments more widely available and combining them with expression data originating from other proteomics and transcriptomics datasets. Collapse Key Words proteome informatics data integration data processing Collapse MESH Headings Data Analysis Databases, Protein Datasets as Topic Mass Spectrometry/methods Proteomics/methods Software Collapse Grants Wellcome Trust RCUK \| Biotechnology and Biological Sciences Research Council (BBSRC) Wellcome Trust (Wellcome) Collapse
16	Method for Independent Estimation of the False Localization Rate for Phosphoproteomics. J Proteome Res 2022;21:1603-1615. [PMID: 35640880 PMCID: PMC9251759 DOI: 10.1021/acs.jproteome.1c00827] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Abstract Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840. Collapse Key Words database searching false localization rate phosphoproteomics software statistics Collapse MESH Headings Collapse Grants Collapse
17	Proteomics Standards Initiative's ProForma 2.0: Unifying the Encoding of Proteoforms and Peptidoforms. J Proteome Res 2022;21:1189-1195. [PMID: 35290070 PMCID: PMC7612572 DOI: 10.1021/acs.jproteome.1c00771] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Abstract It is important for the proteomics community to have a standardized manner to represent all possible variations of a protein or peptide primary sequence, including natural, chemically-induced and artifactual modifications. The Human Proteome Organization (HUPO) Proteomics Standards Initiative (PSI) in collaboration with several members of the Consortium for Top-Down Proteomics (CTDP) has developed a standard notation called ProForma 2.0, which is a substantial extension of the original ProForma notation developed by the CTDP. ProForma 2.0 aims to unify the representation of proteoforms and peptidoforms. ProForma 2.0 supports use cases needed for bottom-up and middle-/top-down proteomics approaches and allows the encoding of highly modified proteins and peptides using a human-and machine-readable string. ProForma 2.0 can be used to represent protein modifications in a specified or ambiguous location, designated by mass shifts, chemical formulas, or controlled vocabulary terms, including cross-links (natural and chemical), and atomic isotopes. Notational conventions are based on public controlled vocabularies and ontologies. The most up-to-date full specification document and information about software implementations are available at http://psidev.info/proforma. Collapse Key Words FAIR ProForma data standards file formats mass spectrometry peptidoform proteoform top-down proteomics Collapse MESH Headings Collapse Grants Collapse
18	Expression Atlas update: gene and protein expression in multiple species. Nucleic Acids Res 2022;50:D129-D140. [PMID: 34850121 PMCID: PMC8728300 DOI: 10.1093/nar/gkab1030] [Citation(s) in RCA: 63] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 10/11/2021] [Accepted: 11/19/2021] [Indexed: 01/21/2023] Open Abstract The EMBL-EBI Expression Atlas is an added value knowledge base that enables researchers to answer the question of where (tissue, organism part, developmental stage, cell type) and under which conditions (disease, treatment, gender, etc) a gene or protein of interest is expressed. Expression Atlas brings together data from >4500 expression studies from >65 different species, across different conditions and tissues. It makes these data freely available in an easy to visualise form, after expert curation to accurately represent the intended experimental design, re-analysed via standardised pipelines that rely on open-source community developed tools. Each study's metadata are annotated using ontologies. The data are re-analyzed with the aim of reproducing the original conclusions of the underlying experiments. Expression Atlas is currently divided into Bulk Expression Atlas and Single Cell Expression Atlas. Expression Atlas contains data from differential studies (microarray and bulk RNA-Seq) and baseline studies (bulk RNA-Seq and proteomics), whereas Single Cell Expression Atlas is currently dedicated to Single Cell RNA-Sequencing (scRNA-Seq) studies. The resource has been in continuous development since 2009 and it is available at https://www.ebi.ac.uk/gxa. Collapse Key Words Collapse MESH Headings Computational Biology Databases, Genetic Gene Expression Profiling Humans Proteins/chemistry Proteins/genetics Proteomics RNA-Seq Sequence Analysis, RNA Single-Cell Analysis Software Collapse Grants 108437/Z/15/Z Wellcome Trust Wellcome Trust BB/P024599/1 Biotechnology and Biological Sciences Research Council 221401/Z/20/Z Wellcome Trust BB/T019670/1 Biotechnology and Biological Sciences Research Council BB/T014563/1 Biotechnology and Biological Sciences Research Council Collapse
19	The growing need for controlled data access models in clinical proteomics and metabolomics. Nat Commun 2021;12:5787. [PMID: 34599180 PMCID: PMC8486822 DOI: 10.1038/s41467-021-26110-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 09/17/2021] [Indexed: 01/25/2023] Open Abstract More and more clinical studies include potentially sensitive human proteomics or metabolomics datasets, but bioinformatics resources for managing the access to these data are not yet available. This commentary discusses current best practices and future perspectives for the responsible handling of clinical proteomics and metabolomics data. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
20	Universal Spectrum Identifier for mass spectra. Nat Methods 2021;18:768-770. [PMID: 34183830 PMCID: PMC8405201 DOI: 10.1038/s41592-021-01184-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2020] [Accepted: 05/10/2021] [Indexed: 02/03/2023] Abstract Mass spectra provide the ultimate evidence to support the findings of mass spectrometry proteomics studies in publications, and it is therefore crucial to be able to trace the conclusions back to the spectra. The Universal Spectrum Identifier (USI) provides a standardized mechanism for encoding a virtual path to any mass spectrum contained in datasets deposited to public proteomics repositories. USI enables greater transparency of spectral evidence, with more than 1 billion USI identifications from over 3 billion spectra already available through ProteomeXchange repositories. Collapse Key Words proteomics standards initiative psi mass spectrometry proteomics universal spectrum identifier usi standards Collapse MESH Headings Collapse Grants Collapse
21	BioContainers Registry: Searching Bioinformatics and Proteomics Tools, Packages, and Containers. J Proteome Res 2021;20:2056-2061. [PMID: 33625229 PMCID: PMC7611561 DOI: 10.1021/acs.jproteome.0c00904] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Abstract BioContainers is an open-source project that aims to create, store, and distribute bioinformatics software containers and packages. The BioContainers community has developed a set of guidelines to standardize software containers including the metadata, versions, licenses, and software dependencies. BioContainers supports multiple packaging and container technologies such as Conda, Docker, and Singularity. The BioContainers provide over 9000 bioinformatics tools, including more than 200 proteomics and mass spectrometry tools. Here we introduce the BioContainers Registry and Restful API to make containerized bioinformatics tools more findable, accessible, interoperable, and reusable (FAIR). The BioContainers Registry provides a fast and convenient way to find and retrieve bioinformatics tool packages and containers. By doing so, it will increase the use of bioinformatics packages and containers while promoting replicability and reproducibility in research. Collapse Key Words BioContainers cloud computational proteomics high-performance computing large-scale data analysis Collapse MESH Headings Computational Biology Proteomics Registries Reproducibility of Results Software Collapse Grants Wellcome Trust 208391 Wellcome Trust 208391/Z/17/Z Wellcome Trust Collapse
22	Data Management of Sensitive Human Proteomics Data: Current Practices, Recommendations, and Perspectives for the Future. Mol Cell Proteomics 2021;20:100071. [PMID: 33711481 PMCID: PMC8056256 DOI: 10.1016/j.mcpro.2021.100071] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/01/2021] [Accepted: 03/02/2021] [Indexed: 12/12/2022] Open Abstract Today it is the norm that all relevant proteomics data that support the conclusions in scientific publications are made available in public proteomics data repositories. However, given the increase in the number of clinical proteomics studies, an important emerging topic is the management and dissemination of clinical, and thus potentially sensitive, human proteomics data. Both in the United States and in the European Union, there are legal frameworks protecting the privacy of individuals. Implementing privacy standards for publicly released research data in genomics and transcriptomics has led to processes to control who may access the data, so-called "controlled access" data. In parallel with the technological developments in the field, it is clear that the privacy risks of sharing proteomics data need to be properly assessed and managed. In our view, the proteomics community must be proactive in addressing these issues. Yet a careful balance must be kept. On the one hand, neglecting to address the potential of identifiability in human proteomics data could lead to reputational damage of the field, while on the other hand, erecting barriers to open access to clinical proteomics data will inevitably reduce reuse of proteomics data and could substantially delay critical discoveries in biomedical research. In order to balance these apparently conflicting requirements for data privacy and efficient use and reuse of research efforts through the sharing of clinical proteomics data, development efforts will be needed at different levels including bioinformatics infrastructure, policymaking, and mechanisms of oversight. Collapse Key Words controlled access data databases ethics mass spectrometry policy proteomics Collapse MESH Headings Confidentiality Data Management Humans Information Dissemination Proteomics Collapse Grants R24 GM127667 NIGMS NIH HHS U19 AG023122 NIA NIH HHS P41 GM103484 NIGMS NIH HHS R01 GM087221 NIGMS NIH HHS R01 LM013115 NLM NIH HHS BB/P024599/1 Biotechnology and Biological Sciences Research Council Wellcome Trust 208391/Z/17/Z Wellcome Trust Collapse
23	Using Deep Learning to Extrapolate Protein Expression Measurements. Proteomics 2020;20:e2000009. [PMID: 32937025 PMCID: PMC7757209 DOI: 10.1002/pmic.202000009] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 08/27/2020] [Indexed: 01/23/2023] Abstract Mass spectrometry (MS)-based quantitative proteomics experiments typically assay a subset of up to 60% of the ≈20 000 human protein coding genes. Computational methods for imputing the missing values using RNA expression data usually allow only for imputations of proteins measured in at least some of the samples. In silico methods for comprehensively estimating abundances across all proteins are still missing. Here, a novel method is proposed using deep learning to extrapolate the observed protein expression values in label-free MS experiments to all proteins, leveraging gene functional annotations and RNA measurements as key predictive attributes. This method is tested on four datasets, including human cell lines and human and mouse tissues. This method predicts the protein expression values with average R 2 scores between 0.46 and 0.54, which is significantly better than predictions based on correlations using the RNA expression data alone. Moreover, it is demonstrated that the derived models can be "transferred" across experiments and species. For instance, the model derived from human tissues gave a R 2 = 0.51 when applied to mouse tissue data. It is concluded that protein abundances generated in label-free MS experiments can be computationally predicted using functional annotated attributes and can be used to highlight aberrant protein abundance values. Collapse Key Words Gene Ontology UniProt keywords deep learning networks mass spectrometry protein abundance prediction Collapse MESH Headings Animals Deep Learning Mass Spectrometry Mice Molecular Sequence Annotation Proteins Proteomics Collapse Grants C309/A25144 Cancer Research UK U41 HG007234 NHGRI NIH HHS Wellcome Trust Cancer Research UK Grant number 208391/Z/17/Z Wellcome Trust Wellcome Trust Latvijas Zinātnes Padome European Regional Development Fund Collapse
24	The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res 2020;47:D442-D450. [PMID: 30395289 PMCID: PMC6323896 DOI: 10.1093/nar/gky1106] [Citation(s) in RCA: 4975] [Impact Index Per Article: 1243.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2018] [Accepted: 10/22/2018] [Indexed: 02/06/2023] Open Abstract The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world’s largest data repository of mass spectrometry-based proteomics data, and is one of the founding members of the global ProteomeXchange (PX) consortium. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2016. In the last 3 years, public data sharing through PRIDE (as part of PX) has definitely become the norm in the field. In parallel, data re-use of public proteomics data has increased enormously, with multiple applications. We first describe the new architecture of PRIDE Archive, the archival component of PRIDE. PRIDE Archive and the related data submission framework have been further developed to support the increase in submitted data volumes and additional data types. A new scalable and fault tolerant storage backend, Application Programming Interface and web interface have been implemented, as a part of an ongoing process. Additionally, we emphasize the improved support for quantitative proteomics data through the mzTab format. At last, we outline key statistics on the current data contents and volume of downloads, and how PRIDE data are starting to be disseminated to added-value resources including Ensembl, UniProt and Expression Atlas. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
25	A five-level classification system for proteoform identifications. Nat Methods 2020;16:939-940. [PMID: 31451767 DOI: 10.1038/s41592-019-0573-x] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
26	The Human Immunopeptidome Project: A Roadmap to Predict and Treat Immune Diseases. Mol Cell Proteomics 2020;19:31-49. [PMID: 31744855 PMCID: PMC6944237 DOI: 10.1074/mcp.r119.001743] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 11/18/2019] [Indexed: 12/11/2022] Open Abstract The science that investigates the ensembles of all peptides associated to human leukocyte antigen (HLA) molecules is termed "immunopeptidomics" and is typically driven by mass spectrometry (MS) technologies. Recent advances in MS technologies, neoantigen discovery and cancer immunotherapy have catalyzed the launch of the Human Immunopeptidome Project (HIPP) with the goal of providing a complete map of the human immunopeptidome and making the technology so robust that it will be available in every clinic. Here, we provide a long-term perspective of the field and we use this framework to explore how we think the completion of the HIPP will truly impact the society in the future. In this context, we introduce the concept of immunopeptidome-wide association studies (IWAS). We highlight the importance of large cohort studies for the future and how applying quantitative immunopeptidomics at population scale may provide a new look at individual predisposition to common immune diseases as well as responsiveness to vaccines and immunotherapies. Through this vision, we aim to provide a fresh view of the field to stimulate new discussions within the community, and present what we see as the key challenges for the future for unlocking the full potential of immunopeptidomics in this era of precision medicine. Collapse Key Words HLA/MHC IWAS Immunology cancer therapeutics immunopeptidome infectious disease mass spectrometry peptides Collapse MESH Headings Alleles Autoimmune Diseases/diagnosis Autoimmune Diseases/therapy Cohort Studies HLA Antigens/immunology Histocompatibility Antigens Class I/immunology Histocompatibility Antigens Class II/immunology Humans Immunotherapy Infections/diagnosis Infections/therapy Mass Spectrometry Neoplasms/diagnosis Neoplasms/therapy Peptides/chemistry Peptides/immunology Precision Medicine Prognosis Collapse Grants Wellcome Trust 208391 Wellcome Trust BB/P024599/1 Biotechnology and Biological Sciences Research Council Wellcome Trust (Wellcome) UK Research and Innovation \| Biotechnology and Biological Sciences Research Council (BBSRC) Epic Foundation Collapse
27	Proteomics Standards Initiative Extended FASTA Format. J Proteome Res 2019;18:2686-2692. [PMID: 31081335 DOI: 10.1021/acs.jproteome.9b00064] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Abstract Mass-spectrometry-based proteomics enables the high-throughput identification and quantification of proteins, including sequence variants and post-translational modifications (PTMs) in biological samples. However, most workflows require that such variations be included in the search space used to analyze the data, and doing so remains challenging with most analysis tools. In order to facilitate the search for known sequence variants and PTMs, the Proteomics Standards Initiative (PSI) has designed and implemented the PSI extended FASTA format (PEFF). PEFF is based on the very popular FASTA format but adds a uniform mechanism for encoding substantially more metadata about the sequence collection as well as individual entries, including support for encoding known sequence variants, PTMs, and proteoforms. The format is very nearly backward compatible, and as such, existing FASTA parsers will require little or no changes to be able to read PEFF files as FASTA files, although without supporting any of the extra capabilities of PEFF. PEFF is defined by a full specification document, controlled vocabulary terms, a set of example files, software libraries, and a file validator. Popular software and resources are starting to support PEFF, including the sequence search engine Comet and the knowledge bases neXtProt and UniProtKB. Widespread implementation of PEFF is expected to further enable proteogenomics and top-down proteomics applications by providing a standardized mechanism for encoding protein sequences and their known variations. All the related documentation, including the detailed file format specification and example files, are available at http://www.psidev.info/peff . Collapse Key Words FASTA PEFF PSI Proteomics Standards Initiative file formats mass spectrometry proteogenomics proteomics standards Collapse MESH Headings Collapse Grants Collapse
28	Spectral Clustering Improves Label-Free Quantification of Low-Abundant Proteins. J Proteome Res 2019;18:1477-1485. [PMID: 30859831 PMCID: PMC6456873 DOI: 10.1021/acs.jproteome.8b00377] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2018] [Indexed: 11/29/2022] Abstract Label-free quantification has become a common-practice in many mass spectrometry-based proteomics experiments. In recent years, we and others have shown that spectral clustering can considerably improve the analysis of (primarily large-scale) proteomics data sets. Here we show that spectral clustering can be used to infer additional peptide-spectrum matches and improve the quality of label-free quantitative proteomics data in data sets also containing only tens of MS runs. We analyzed four well-known public benchmark data sets that represent different experimental settings using spectral counting and peak intensity based label-free quantification. In both approaches, the additionally inferred peptide-spectrum matches through our spectra-cluster algorithm improved the detectability of low abundant proteins while increasing the accuracy of the derived quantitative data, without increasing the data sets' noise. Additionally, we developed a Proteome Discoverer node for our spectra-cluster algorithm which allows anyone to rebuild our proposed pipeline using the free version of Proteome Discoverer. Collapse Key Words IMP free nodes Proteome Discoverer Proteome Discoverer node benchmarking study bioinformatics label-free quantification mass spectrometry proteomics spectral clustering spectral counting Collapse MESH Headings Algorithms Cluster Analysis Databases, Protein Humans Mass Spectrometry/methods Proteome/analysis Proteomics/methods Collapse Grants P 30325 Austrian Science Fund FWF WT101477MA Wellcome Trust I 3686 Austrian Science Fund FWF 208391/Z/17/Z Wellcome Trust Wellcome Trust Collapse
29	mzTab-M: A Data Standard for Sharing Quantitative Results in Mass Spectrometry Metabolomics. Anal Chem 2019;91:3302-3310. [PMID: 30688441 PMCID: PMC6660005 DOI: 10.1021/acs.analchem.8b04310] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Accepted: 01/28/2019] [Indexed: 12/29/2022] Abstract Mass spectrometry (MS) is one of the primary techniques used for large-scale analysis of small molecules in metabolomics studies. To date, there has been little data format standardization in this field, as different software packages export results in different formats represented in XML or plain text, making data sharing, database deposition, and reanalysis highly challenging. Working within the consortia of the Metabolomics Standards Initiative, Proteomics Standards Initiative, and the Metabolomics Society, we have created mzTab-M to act as a common output format from analytical approaches using MS on small molecules. The format has been developed over several years, with input from a wide range of stakeholders. mzTab-M is a simple tab-separated text format, but importantly, the structure is highly standardized through the design of a detailed specification document, tightly coupled to validation software, and a mandatory controlled vocabulary of terms to populate it. The format is able to represent final quantification values from analyses, as well as the evidence trail in terms of features measured directly from MS (e.g., LC-MS, GC-MS, DIMS, etc.) and different types of approaches used to identify molecules. mzTab-M allows for ambiguity in the identification of molecules to be communicated clearly to readers of the files (both people and software). There are several implementations of the format available, and we anticipate widespread adoption in the field. Collapse Key Words Collapse MESH Headings Databases, Factual Mass Spectrometry Metabolomics/methods Software Collapse Grants R24 GM127667 NIGMS NIH HHS BB/L024128/1 Biotechnology and Biological Sciences Research Council BB/M020282/1 Biotechnology and Biological Sciences Research Council BB/K01997X/1 Biotechnology and Biological Sciences Research Council 001 World Health Organization BB/E025080/1 Biotechnology and Biological Sciences Research Council BB/I000771/1 Biotechnology and Biological Sciences Research Council Collapse
30	Quantitative Proteomics Data in the Public Domain: Challenges and Opportunities. Methods Mol Biol 2019;1977:217-235. [PMID: 30980331 DOI: 10.1007/978-1-4939-9232-4_14] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023] Abstract Mass spectrometry based proteomics is no longer only a qualitative discipline, and can be successfully employed to obtain a truly multidimensional view of the proteome. In particular, systematic protein expression profiling is now a routine part of many studies in the field and beyond. The large growth in the number of quantitative studies is accompanied by a trend to share publicly the associated analysis results and the underlying raw data. This trend, established and strongly supported by public repositories such as the PRIDE database at the European Bioinformatics Institute, opens up enormous possibilities to explore the data beyond the original publications, for instance by reusing, reanalyzing, and performing different flavors of meta-analysis studies. To help researchers and scientists realize about this potential, here we describe the mainstream public proteomics resources containing quantitative proteomics data, including the processed analysis results and/or the underlying raw data. We then present and discuss the most important points to consider when attempting to (re)use proteomics data in the public domain. We conclude by highlighting potential pitfalls of (re)using quantitative data and discuss some of our own experiences in this context. Collapse Key Words Data (re)analysis Data repository Mass spectrometry PRIDE database Quantitative proteomics Collapse MESH Headings Computational Biology/methods Data Analysis Databases, Protein Humans Mass Spectrometry Proteomics/methods Proteomics/standards Reproducibility of Results Web Browser Collapse Grants WT101477MA Wellcome Trust 208391/Z/17/Z Wellcome Trust Collapse
31	Expanding the Use of Spectral Libraries in Proteomics. J Proteome Res 2018;17:4051-4060. [PMID: 30270626 PMCID: PMC6443480 DOI: 10.1021/acs.jproteome.8b00485] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Abstract The 2017 Dagstuhl Seminar on Computational Proteomics provided an opportunity for a broad discussion on the current state and future directions of the generation and use of peptide tandem mass spectrometry spectral libraries. Their use in proteomics is growing slowly, but there are multiple challenges in the field that must be addressed to further increase the adoption of spectral libraries and related techniques. The primary bottlenecks are the paucity of high quality and comprehensive libraries and the general difficulty of adopting spectral library searching into existing workflows. There are several existing spectral library formats, but none captures a satisfactory level of metadata; therefore, a logical next improvement is to design a more advanced, Proteomics Standards Initiative-approved spectral library format that can encode all of the desired metadata. The group discussed a series of metadata requirements organized into three designations of completeness or quality, tentatively dubbed bronze, silver, and gold. The metadata can be organized at four different levels of granularity: at the collection (library) level, at the individual entry (peptide ion) level, at the peak (fragment ion) level, and at the peak annotation level. Strategies for encoding mass modifications in a consistent manner and the requirement for encoding high-quality and commonly seen but as-yet-unidentified spectra were discussed. The group also discussed related topics, including strategies for comparing two spectra, techniques for generating representative spectra for a library, approaches for selection of optimal signature ions for targeted workflows, and issues surrounding the merging of two or more libraries into one. We present here a review of this field and the challenges that the community must address in order to accelerate the adoption of spectral libraries in routine analysis of proteomics datasets. Collapse Key Words Dagstuhl Seminar Proteomics Standards Initiative formats mass spectrometry meeting report spectral libraries standards Collapse MESH Headings Animals Databases, Protein/standards Humans Peptide Library Proteomics/methods Tandem Mass Spectrometry/methods Workflow Collapse Grants R24 GM127667 NIGMS NIH HHS U54 EB020406 NIBIB NIH HHS P41 GM103484 NIGMS NIH HHS R01 GM087221 NIGMS NIH HHS BB/M024954 Biotechnology and Biological Sciences Research Council WT101477MA Wellcome Trust 001 World Health Organization MR/L011093/3 Medical Research Council MR/L011093/1 Medical Research Council BB/P024599/1 Biotechnology and Biological Sciences Research Council 208391/Z/17/Z Wellcome Trust Wellcome Trust MR/N028457/1 Medical Research Council MR/L011093/2 Medical Research Council Collapse
32	Future Prospects of Spectral Clustering Approaches in Proteomics. Proteomics 2018;18:e1700454. [PMID: 29882266 PMCID: PMC6099476 DOI: 10.1002/pmic.201700454] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 05/23/2018] [Indexed: 12/14/2022] Abstract In this article, current and future applications of spectral clustering are discussed in the context of mass spectrometry-based proteomics approaches. First of all, the main algorithms and tools that can currently be used to perform spectral clustering are introduced. In addition, its main applications and their use in current computational proteomics workflows are explained, including the generation of spectral libraries and spectral archives. Finally, possible future directions for spectral clustering, including its potential use to achieve a deeper coverage of the proteome and the discovery of novel post-translational modifications and single amino acid variants. Collapse Key Words algorithms computational proteomics mass spectrometry spectral clustering Collapse MESH Headings Algorithms Cluster Analysis Databases, Protein Humans Proteome/analysis Proteomics/methods Spectrum Analysis/methods Collapse Grants Wellcome Trust WT101477MA Wellcome Trust BB/P024599/1 Biotechnology and Biological Sciences Research Council Collapse
33	Minimal Information About an Immuno-Peptidomics Experiment (MIAIPE). Proteomics 2018;18:e1800110. [PMID: 29791771 PMCID: PMC6033177 DOI: 10.1002/pmic.201800110] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2018] [Indexed: 12/19/2022] Abstract Minimal information about an immuno-peptidomics experiment (MIAIPE) is an initiative of the members of the Human Immuno-Peptidome Project (HIPP), an international program organized by the Human Proteome Organization (HUPO). The aim of the MIAIPE guidelines is to deliver technical guidelines representing the minimal information required to sufficiently support the evaluation and interpretation of immunopeptidomics experiments. The MIAIPE document has been designed to report essential information about sample preparation, mass spectrometric measurement, and associated mass spectrometry (MS)-related bioinformatics aspects that are unique to immunopeptidomics and may not be covered by the general proteomics MIAPE (minimal information about a proteomics experiment) guidelines. Collapse Key Words antigen processing and presentation immunopeptidomics major histocompatibility complex Collapse MESH Headings Computational Biology/standards Databases, Protein Histocompatibility Antigens Class I/analysis Histocompatibility Antigens Class I/immunology Histocompatibility Antigens Class I/metabolism Histocompatibility Antigens Class II/analysis Histocompatibility Antigens Class II/immunology Histocompatibility Antigens Class II/metabolism Humans Peptide Fragments/analysis Peptide Fragments/immunology Peptide Fragments/metabolism Proteomics/standards Software Specimen Handling/standards Collapse Grants R24 GM127667 NIGMS NIH HHS U01 CA194389 NCI NIH HHS Collapse
34	Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra". J Proteome Res 2018;17:1993-1996. [PMID: 29682973 DOI: 10.1021/acs.jproteome.7b00824] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Abstract In the recent benchmarking article entitled "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra", Rieder et al. compared several different approaches to cluster MS/MS spectra. While we certainly recognize the value of the manuscript, here, we report some shortcomings detected in the original analyses. For most analyses, the authors clustered only single MS/MS runs. In one of the reported analyses, three MS/MS runs were processed together, which already led to computational performance issues in many of the tested approaches. This fact highlights the difficulties of using many of the tested algorithms on the nowadays produced average proteomics data sets. Second, the authors only processed identified spectra when merging MS runs. Thereby, all unidentified spectra that are of lower quality were already removed from the data set and could not influence the clustering results. Next, we found that the authors did not analyze the effect of chimeric spectra on the clustering results. In our analysis, we found that 3% of the spectra in the used data sets were chimeric, and this had marked effects on the behavior of the different clustering algorithms tested. Finally, the authors' choice to evaluate the MS-Cluster and spectra-cluster algorithms using a precursor tolerance of 5 Da for high-resolution Orbitrap data only was, in our opinion, not adequate to assess the performance of MS/MS clustering approaches. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
35	The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data. Genome Biol 2018;19:12. [PMID: 29386051 PMCID: PMC5793360 DOI: 10.1186/s13059-017-1377-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 12/07/2017] [Indexed: 01/23/2023] Open Abstract On behalf of The Human Proteome Organization (HUPO) Proteomics Standards Initiative, we introduce here two novel standard data formats, proBAM and proBed, that have been developed to address the current challenges of integrating mass spectrometry-based proteomics data with genomics and transcriptomics information in proteogenomics studies. proBAM and proBed are adaptations of the well-defined, widely used file formats SAM/BAM and BED, respectively, and both have been extended to meet the specific requirements entailed by proteomics data. Therefore, existing popular genomics tools such as SAMtools and Bedtools, and several widely used genome browsers, can already be used to manipulate and visualize these formats "out-of-the-box." We also highlight that a number of specific additional software tools, properly supporting the proteomics information available in these formats, are now available providing functionalities such as file generation, file conversion, and data analysis. All the related documentation, including the detailed file format specifications and example files, are accessible at http://www.psidev.info/probam and at http://www.psidev.info/probed . Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
36	Enhanced Missing Proteins Detection in NCI60 Cell Lines Using an Integrative Search Engine Approach. J Proteome Res 2017;16:4374-4390. [PMID: 28960077 PMCID: PMC5737412 DOI: 10.1021/acs.jproteome.7b00388] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Abstract The Human Proteome Project (HPP) aims deciphering the complete map of the human proteome. In the past few years, significant efforts of the HPP teams have been dedicated to the experimental detection of the missing proteins, which lack reliable mass spectrometry evidence of their existence. In this endeavor, an in depth analysis of shotgun experiments might represent a valuable resource to select a biological matrix in design validation experiments. In this work, we used all the proteomic experiments from the NCI60 cell lines and applied an integrative approach based on the results obtained from Comet, Mascot, OMSSA, and X!Tandem. This workflow benefits from the complementarity of these search engines to increase the proteome coverage. Five missing proteins C-HPP guidelines compliant were identified, although further validation is needed. Moreover, 165 missing proteins were detected with only one unique peptide, and their functional analysis supported their participation in cellular pathways as was also proposed in other studies. Finally, we performed a combined analysis of the gene expression levels and the proteomic identifications from the common cell lines between the NCI60 and the CCLE project to suggest alternatives for further validation of missing protein observations. Collapse Key Words C-HPP CCLE NCI60 integration of search engines missing proteins peptide detectability Collapse MESH Headings Collapse Grants Collapse
37	OLS Client and OLS Dialog: Open Source Tools to Annotate Public Omics Datasets. Proteomics 2017;17:1700244. [PMID: 28792687 PMCID: PMC5707441 DOI: 10.1002/pmic.201700244] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2017] [Revised: 07/12/2017] [Indexed: 01/12/2023] Abstract The availability of user-friendly software to annotate biological datasets and experimental details is becoming essential in data management practices, both in local storage systems and in public databases. The Ontology Lookup Service (OLS, http://www.ebi.ac.uk/ols) is a popular centralized service to query, browse and navigate biomedical ontologies and controlled vocabularies. Recently, the OLS framework has been completely redeveloped (version 3.0), including enhancements in the data model, like the added support for Web Ontology Language based ontologies, among many other improvements. However, the new OLS is not backwards compatible and new software tools are needed to enable access to this widely used framework now that the previous version is no longer available. We here present the OLS Client as a free, open-source Java library to retrieve information from the new version of the OLS. It enables rapid tool creation by providing a robust, pluggable programming interface and common data model to programmatically access the OLS. The library has already been integrated and is routinely used by several bioinformatics resources and related data annotation tools. Secondly, we also introduce an updated version of the OLS Dialog (version 2.0), a Java graphical user interface that can be easily plugged into Java desktop applications to access the OLS. The software and related documentation are freely available at https://github.com/PRIDE-Utilities/ols-client and https://github.com/PRIDE-Toolsuite/ols-dialog. Collapse Key Words data annotation omics datasets ontologies open source software Collapse MESH Headings Biological Ontologies Computational Biology/methods Databases, Factual Genomics Humans Information Storage and Retrieval Metabolomics Proteomics Software User-Computer Interface Collapse Grants Wellcome Trust Collapse
38	Proteomics Standards Initiative: Fifteen Years of Progress and Future Work. J Proteome Res 2017;16:4288-4298. [PMID: 28849660 PMCID: PMC5715286 DOI: 10.1021/acs.jproteome.7b00370] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Abstract The Proteomics Standards Initiative (PSI) of the Human Proteome Organization (HUPO) has now been developing and promoting open community standards and software tools in the field of proteomics for 15 years. Under the guidance of the chair, cochairs, and other leadership positions, the PSI working groups are tasked with the development and maintenance of community standards via special workshops and ongoing work. Among the existing ratified standards, the PSI working groups continue to update PSI-MI XML, MITAB, mzML, mzIdentML, mzQuantML, mzTab, and the MIAPE (Minimum Information About a Proteomics Experiment) guidelines with the advance of new technologies and techniques. Furthermore, new standards are currently either in the final stages of completion (proBed and proBAM for proteogenomics results as well as PEFF) or in early stages of design (a spectral library standard format, a universal spectrum identifier, the qcML quality control format, and the Protein Expression Interface (PROXI) web services Application Programming Interface). In this work we review the current status of all of these aspects of the PSI, describe synergies with other efforts such as the ProteomeXchange Consortium, the Human Proteome Project, and the metabolomics community, and provide a look at future directions of the PSI. Collapse Key Words bioinformatics software data standard database mass spectrometry metabolomics molecular interactions protein identification protein quantification proteomics quality control Collapse MESH Headings Collapse Grants Collapse
39	Using the PRIDE Database and ProteomeXchange for Submitting and Accessing Public Proteomics Datasets. ACTA ACUST UNITED AC 2017;59:13.31.1-13.31.12. [PMID: 28902400 DOI: 10.1002/cpbi.30] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Abstract The ProteomeXchange (PX) Consortium is the unifying framework for world-leading mass spectrometry (MS)-based proteomics repositories. Current members include the PRIDE database (U.K.), PeptideAtlas/PASSEL, and MassIVE (U.S.A.), and jPOST (Japan). The Consortium standardizes submission and dissemination of public proteomics data worldwide. This is achieved through implementing common data submission guidelines and enforcing metadata requirements by each of the members. Furthermore, the members use a common identifier space. Each dataset receives a unique (PXD) accession number and is publicly accessible as soon as the associated scientific publications are released. The two basic protocols provide a step-by-step guide on how to submit data to the PRIDE database, and describe how to access the PX portal (called ProteomeCentral), which can be used to search datasets available in any of the PX members. © 2017 by John Wiley & Sons, Inc. Collapse Key Words PRIDE database data repository mass spectrometry proteomics Collapse MESH Headings Collapse Grants Collapse
40	The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics. Mol Cell Proteomics 2017;16:1275-1285. [PMID: 28515314 PMCID: PMC5500760 DOI: 10.1074/mcp.m117.068429] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Revised: 05/15/2017] [Indexed: 12/31/2022] Open Abstract The first stable version of the Proteomics Standards Initiative mzIdentML open data standard (version 1.1) was published in 2012-capturing the outputs of peptide and protein identification software. In the intervening years, the standard has become well-supported in both commercial and open software, as well as a submission and download format for public repositories. Here we report a new release of mzIdentML (version 1.2) that is required to keep pace with emerging practice in proteome informatics. New features have been added to support: (1) scores associated with localization of modifications on peptides; (2) statistics performed at the level of peptides; (3) identification of cross-linked peptides; and (4) support for proteogenomics approaches. In addition, there is now improved support for the encoding of de novo sequencing of peptides, spectral library searches, and protein inference. As a key point, the underlying XML schema has only undergone very minor modifications to simplify as much as possible the transition from version 1.1 to version 1.2 for implementers, but there have been several notable updates to the format specification, implementation guidelines, controlled vocabularies and validation software. mzIdentML 1.2 can be described as backwards compatible, in that reading software designed for mzIdentML 1.1 should function in most cases without adaptation. We anticipate that these developments will provide a continued stable base for software teams working to implement the standard. All the related documentation is accessible at http://www.psidev.info/mzidentml. Collapse Key Words Collapse MESH Headings Computational Biology/standards Databases, Protein Proteomics/standards Software Collapse Grants WT101477MA Wellcome Trust 108504 Wellcome Trust P41 GM103481 NIGMS NIH HHS U54 EB020406 NIBIB NIH HHS BB/L024128/1 Biotechnology and Biological Sciences Research Council R01 GM087221 NIGMS NIH HHS B/L005239/1 Biotechnology and Biological Sciences Research Council BB/K020145/1 Biotechnology and Biological Sciences Research Council 103139/Z/13/Z Wellcome Trust Wellcome Trust 101477 Wellcome Trust 103139 Wellcome Trust BB/K01997X/1 Biotechnology and Biological Sciences Research Council 092076 Wellcome Trust BB/L024225/1 Biotechnology and Biological Sciences Research Council BB/H024654/1 Biotechnology and Biological Sciences Research Council Biotechnology and Biological Sciences Research Council Wellcome Trust Bundesministerium für Bildung und Forschung National Institute of General Medical Sciences National Institute of Biomedical Imaging and Bioengineering Bergens Forskningsstiftelse Norges Forskningsråd Deutsche Forschungsgemeinschaft Seventh Framework Programme Collapse
41	A community proposal to integrate proteomics activities in ELIXIR. F1000Res 2017;6. [PMID: 28713550 PMCID: PMC5499783 DOI: 10.12688/f1000research.11751.1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/06/2017] [Indexed: 11/20/2022] Open Abstract Computational approaches have been major drivers behind the progress of proteomics in recent years. The aim of this white paper is to provide a framework for integrating computational proteomics into ELIXIR in the near future, and thus to broaden the portfolio of omics technologies supported by this European distributed infrastructure. This white paper is the direct result of a strategy meeting on ‘The Future of Proteomics in ELIXIR’ that took place in March 2017 in Tübingen (Germany), and involved representatives of eleven ELIXIR nodes. These discussions led to a list of priority areas in computational proteomics that would complement existing activities and close gaps in the portfolio of tools and services offered by ELIXIR so far. We provide some suggestions on how these activities could be integrated into ELIXIR’s existing platforms, and how it could lead to a new ELIXIR use case in proteomics. We also highlight connections to the related field of metabolomics, where similar activities are ongoing. This white paper could thus serve as a starting point for the integration of computational proteomics into ELIXIR. Over the next few months we will be working closely with all stakeholders involved, and in particular with other representatives of the proteomics community, to further refine this paper. Collapse Key Words bioinformatics infrastructure computational proteomics data standards databases mass spectrometry multi-omics approaches. proteomics training Collapse MESH Headings Collapse Grants Collapse
42	2016 update of the PRIDE database and its related tools. Nucleic Acids Res 2016;44:11033. [PMID: 27683222 PMCID: PMC5159556 DOI: 10.1093/nar/gkw880] [Citation(s) in RCA: 600] [Impact Index Per Article: 75.0] [Reference Citation Analysis] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open Abstract Collapse Key Words Collapse MESH Headings Collapse Grants BB/I00095X/1 Biotechnology and Biological Sciences Research Council Collapse
43	Erratum to: Making sense of big data in health research: towards an EU action plan. Genome Med 2016;8:118. [PMID: 27821178 PMCID: PMC5100330 DOI: 10.1186/s13073-016-0376-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Accepted: 10/26/2016] [Indexed: 11/10/2022] Open Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
44	The ProteomeXchange consortium in 2017: supporting the cultural change in proteomics public data deposition. Nucleic Acids Res 2016;45:D1100-D1106. [PMID: 27924013 PMCID: PMC5210636 DOI: 10.1093/nar/gkw936] [Citation(s) in RCA: 648] [Impact Index Per Article: 81.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Accepted: 10/07/2016] [Indexed: 11/13/2022] Open Abstract The ProteomeXchange (PX) Consortium of proteomics resources (http://www.proteomexchange.org) was formally started in 2011 to standardize data submission and dissemination of mass spectrometry proteomics data worldwide. We give an overview of the current consortium activities and describe the advances of the past few years. Augmenting the PX founding members (PRIDE and PeptideAtlas, including the PASSEL resource), two new members have joined the consortium: MassIVE and jPOST. ProteomeCentral remains as the common data access portal, providing the ability to search for data sets in all participating PX resources, now with enhanced data visualization components. We describe the updated submission guidelines, now expanded to include four members instead of two. As demonstrated by data submission statistics, PX is supporting a change in culture of the proteomics field: public data sharing is now an accepted standard, supported by requirements for journal submissions resulting in public data release becoming the norm. More than 4500 data sets have been submitted to the various PX resources since 2012. Human is the most represented species with approximately half of the data sets, followed by some of the main model organisms and a growing list of more than 900 diverse species. Data reprocessing activities are becoming more prominent, with both MassIVE and PeptideAtlas releasing the results of reprocessed data sets. Finally, we outline the upcoming advances for ProteomeXchange. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
45	Detection of Missing Proteins Using the PRIDE Database as a Source of Mass Spectrometry Evidence. J Proteome Res 2016;15:4101-4115. [PMID: 27581094 PMCID: PMC5099979 DOI: 10.1021/acs.jproteome.6b00437] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Abstract The current catalogue of the human proteome is not yet complete, as experimental proteomics evidence is still elusive for a group of proteins known as the missing proteins. The Human Proteome Project (HPP) has been successfully using technology and bioinformatic resources to improve the characterization of such challenging proteins. In this manuscript, we propose a pipeline starting with the mining of the PRIDE database to select a group of data sets potentially enriched in missing proteins that are subsequently analyzed for protein identification with a method based on the statistical analysis of proteotypic peptides. Spermatozoa and the HEK293 cell line were found to be a promising source of missing proteins and clearly merit further attention in future studies. After the analysis of the selected samples, we found 342 PSMs, suggesting the presence of 97 missing proteins in human spermatozoa or the HEK293 cell line, while only 36 missing proteins were potentially detected in the retina, frontal cortex, aorta thoracica, or placenta. The functional analysis of the missing proteins detected confirmed their tissue specificity, and the validation of a selected set of peptides using targeted proteomics (SRM/MRM assays) further supports the utility of the proposed pipeline. As illustrative examples, DNAH3 and TEPP in spermatozoa, and UNCX and ATAD3C in HEK293 cells were some of the more robust and remarkable identifications in this study. We provide evidence indicating the relevance to carefully analyze the ever-increasing MS/MS data available from PRIDE and other repositories as sources for missing proteins detection in specific biological matrices as revealed for HEK293 cells. Collapse Key Words C-HPP MS/MS proteomics PRIDE database missing proteins Collapse MESH Headings Collapse Grants Collapse
46	Ten Simple Rules for Taking Advantage of Git and GitHub. PLoS Comput Biol 2016;12:e1004947. [PMID: 27415786 PMCID: PMC4945047 DOI: 10.1371/journal.pcbi.1004947] [Citation(s) in RCA: 68] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open Abstract Collapse Key Words Collapse MESH Headings Computational Biology Guidelines as Topic Collapse Grants Wellcome Trust BB/I000909/1 Biotechnology and Biological Sciences Research Council R01 EB017205 NIBIB NIH HHS R01 GM094231 NIGMS NIH HHS Collapse
47	Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat Methods 2016;13:651-656. [PMID: 27493588 PMCID: PMC4968634 DOI: 10.1038/nmeth.3902] [Citation(s) in RCA: 114] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Abstract Mass spectrometry (MS) is the main technology used in proteomics approaches. However, on average 75% of spectra analysed in an MS experiment remain unidentified. We propose to use spectrum clustering at a large-scale to shed a light on these unidentified spectra. PRoteomics IDEntifications database (PRIDE) Archive is one of the largest MS proteomics public data repositories worldwide. By clustering all tandem MS spectra publicly available in PRIDE Archive, coming from hundreds of datasets, we were able to consistently characterize three distinct groups of spectra: 1) incorrectly identified spectra, 2) spectra correctly identified but below the set scoring threshold, and 3) truly unidentified spectra. Using a multitude of complementary analysis approaches, we were able to identify less than 20% of the consistently unidentified spectra. The complete spectrum clustering results are available through the new version of the PRIDE Cluster resource (http://www.ebi.ac.uk/pride/cluster). This resource is intended, among other aims, to encourage and simplify further investigation into these unidentified spectra. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
48	Making sense of big data in health research: Towards an EU action plan. Genome Med 2016;8:71. [PMID: 27338147 PMCID: PMC4919856 DOI: 10.1186/s13073-016-0323-y] [Citation(s) in RCA: 124] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open Abstract Medicine and healthcare are undergoing profound changes. Whole-genome sequencing and high-resolution imaging technologies are key drivers of this rapid and crucial transformation. Technological innovation combined with automation and miniaturization has triggered an explosion in data production that will soon reach exabyte proportions. How are we going to deal with this exponential increase in data production? The potential of "big data" for improving health is enormous but, at the same time, we face a wide range of challenges to overcome urgently. Europe is very proud of its cultural diversity; however, exploitation of the data made available through advances in genomic medicine, imaging, and a wide range of mobile health applications or connected devices is hampered by numerous historical, technical, legal, and political barriers. European health systems and databases are diverse and fragmented. There is a lack of harmonization of data formats, processing, analysis, and data transfer, which leads to incompatibilities and lost opportunities. Legal frameworks for data sharing are evolving. Clinicians, researchers, and citizens need improved methods, tools, and training to generate, analyze, and query data effectively. Addressing these barriers will contribute to creating the European Single Market for health, which will improve health and healthcare for all Europeans. Collapse Key Words Collapse MESH Headings Biomedical Research/legislation & jurisprudence Biomedical Research/standards Databases, Factual/legislation & jurisprudence Databases, Factual/standards European Union/organization & administration Health Plan Implementation Humans Information Dissemination/legislation & jurisprudence Collapse Grants Wellcome Trust Collapse
49	Proteomics data visualisation. Proteomics 2016;15:1339-40. [PMID: 25854789 DOI: 10.1002/pmic.201570063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Abstract Collapse Key Words Collapse MESH Headings Data Interpretation, Statistical Proteomics Software Collapse Grants Collapse
50	2016 update of the PRIDE database and its related tools. Nucleic Acids Res 2016;44:D447-56. [PMID: 26527722 PMCID: PMC4702828 DOI: 10.1093/nar/gkv1145] [Citation(s) in RCA: 2508] [Impact Index Per Article: 313.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Revised: 10/14/2015] [Accepted: 10/16/2015] [Indexed: 11/18/2022] Open Abstract The PRoteomics IDEntifications (PRIDE) database is one of the world-leading data repositories of mass spectrometry (MS)-based proteomics data. Since the beginning of 2014, PRIDE Archive (http://www.ebi.ac.uk/pride/archive/) is the new PRIDE archival system, replacing the original PRIDE database. Here we summarize the developments in PRIDE resources and related tools since the previous update manuscript in the Database Issue in 2013. PRIDE Archive constitutes a complete redevelopment of the original PRIDE, comprising a new storage backend, data submission system and web interface, among other components. PRIDE Archive supports the most-widely used PSI (Proteomics Standards Initiative) data standard formats (mzML and mzIdentML) and implements the data requirements and guidelines of the ProteomeXchange Consortium. The wide adoption of ProteomeXchange within the community has triggered an unprecedented increase in the number of submitted data sets (around 150 data sets per month). We outline some statistics on the current PRIDE Archive data contents. We also report on the status of the PRIDE related stand-alone tools: PRIDE Inspector, PRIDE Converter 2 and the ProteomeXchange submission tool. Finally, we will give a brief update on the resources under development 'PRIDE Cluster' and 'PRIDE Proteomes', which provide a complementary view and quality-scored information of the peptide and protein identification data available in PRIDE Archive. Collapse Key Words Collapse MESH Headings Databases, Protein Mass Spectrometry Peptides/chemistry Proteins/chemistry Proteins/metabolism Proteomics Software User-Computer Interface Collapse Grants BB/K01997X/1 Biotechnology and Biological Sciences Research Council BB/L024225/1 Biotechnology and Biological Sciences Research Council BB/I00095X/1 Biotechnology and Biological Sciences Research Council WT085949MA Wellcome Trust BB/I000909/1 Biotechnology and Biological Sciences Research Council WT101477MA Wellcome Trust Collapse