1
|
Kong F, Shen T, Li Y, Bashar A, Bird SS, Fiehn O. Denoising Search doubles the number of metabolite and exposome annotations in human plasma using an Orbitrap Astral mass spectrometer. Nat Methods 2025; 22:1008-1016. [PMID: 40155721 DOI: 10.1038/s41592-025-02646-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 02/24/2025] [Indexed: 04/01/2025]
Abstract
Chemical exposures may affect human metabolism and contribute to the etiology of neurodegenerative disorders such as Alzheimer's disease. Identifying these small metabolites involves matching experimental spectra to reference spectra in databases. However, environmental chemicals or physiologically active metabolites are usually present at low concentrations in human specimens. The presence of noise ions can substantially degrade spectral quality, leading to false negatives and reduced identification rates. In response to this challenge, the Spectral Denoising algorithm removes both chemical and electronic noise. Spectral Denoising outperformed alternative methods in benchmarking studies on 240 tested metabolites. It improved high confident compound identifications at an average 35-fold lower concentrations than previously achievable. Spectral Denoising proved highly robust against varying levels of both chemical and electronic noise even with a greater than 150-fold higher intensity of noise ions than true fragment ions. For human plasma samples from patients with Alzheimer's disease that were analyzed on the Orbitrap Astral mass spectrometer, Denoising Search detected 2.5-fold more annotated compounds compared to the Exploris 240 Orbitrap instrument, including drug metabolites, household and industrial chemicals, and pesticides.
Collapse
Affiliation(s)
- Fanzhou Kong
- Chemistry Department, University of California Davis, Davis, CA, USA
- West Coast Metabolomics Center, University of California Davis, Davis, CA, USA
| | - Tong Shen
- West Coast Metabolomics Center, University of California Davis, Davis, CA, USA
| | - Yuanyue Li
- West Coast Metabolomics Center, University of California Davis, Davis, CA, USA
| | | | | | - Oliver Fiehn
- West Coast Metabolomics Center, University of California Davis, Davis, CA, USA.
| |
Collapse
|
2
|
Haseeb M, Saeed F. GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data. Sci Rep 2023; 13:18713. [PMID: 37907498 PMCID: PMC10618243 DOI: 10.1038/s41598-023-43033-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 09/18/2023] [Indexed: 11/02/2023] Open
Abstract
Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, called GiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5[Formula: see text] speed improvement over its CPU-only predecessor, HiCOPS, and over 10[Formula: see text] improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at: https://github.com/pcdslab/gicops .
Collapse
Affiliation(s)
- Muhammad Haseeb
- Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL, USA
| | - Fahad Saeed
- Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL, USA.
- Biomolecular Sciences Institute (BSI), Miami, FL, USA.
- Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA.
| |
Collapse
|
3
|
Seneviratne AJ, Peters S, Clarke D, Dausmann M, Hecker M, Tully B, Hains PG, Zhong Q. Improved identification and quantification of peptides in mass spectrometry data via chemical and random additive noise elimination (CRANE). Bioinformatics 2021; 37:4719-4726. [PMID: 34323970 PMCID: PMC8711017 DOI: 10.1093/bioinformatics/btab563] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Revised: 06/15/2021] [Accepted: 07/28/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation The output of electrospray ionization–liquid chromatography mass spectrometry (ESI-LC-MS) is influenced by multiple sources of noise and major contributors can be broadly categorized as baseline, random and chemical noise. Noise has a negative impact on the identification and quantification of peptides, which influences the reliability and reproducibility of MS-based proteomics data. Most attempts at denoising have been made on either spectra or chromatograms independently, thus, important 2D information is lost because the mass-to-charge ratio and retention time dimensions are not considered jointly. Results This article presents a novel technique for denoising raw ESI-LC-MS data via 2D undecimated wavelet transform, which is applied to proteomics data acquired by data-independent acquisition MS (DIA-MS). We demonstrate that denoising DIA-MS data results in the improvement of peptide identification and quantification in complex biological samples. Availability and implementation The software is available on Github (https://github.com/CMRI-ProCan/CRANE). The datasets were obtained from ProteomeXchange (Identifiers—PXD002952 and PXD008651). Preliminary data and intermediate files are available via ProteomeXchange (Identifiers—PXD020529 and PXD025103). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Akila J Seneviratne
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Sean Peters
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - David Clarke
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Michael Dausmann
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Michael Hecker
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Brett Tully
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Peter G Hains
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Qing Zhong
- ProCan®, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| |
Collapse
|
4
|
Benchmarking mass spectrometry based proteomics algorithms using a simulated database. ACTA ACUST UNITED AC 2021; 10. [PMID: 34012763 DOI: 10.1007/s13721-021-00298-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Protein sequencing algorithms process data from a variety of instruments that has been generated under diverse experimental conditions. Currently there is no way to predict the accuracy of an algorithm for a given data set. Most of the published algorithms and associated software has been evaluated on limited number of experimental data sets. However, these performance evaluations do not cover the complete search space the algorithmand the software might encounter in real-world. To this end, we present a database of simulated spectra that can be used to benchmark any spectra to peptide search engine. We demonstrate the usability of this database by bench marking two popular peptide sequencing engines. We show wide variation in the accuracy of peptide deductions and a complete quality profile of a given algorithm can be useful for practitioners and algorithm developers. All benchmarking data is available at https://users.cs.fiu.edu/~fsaeed/Benchmark.html.
Collapse
|
5
|
Tariq MU, Haseeb M, Aledhari M, Razzak R, Parizi RM, Saeed F. Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 9:5497-5516. [PMID: 33537181 PMCID: PMC7853650 DOI: 10.1109/access.2020.3047588] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Big Data Proteogenomics lies at the intersection of high-throughput Mass Spectrometry (MS) based proteomics and Next Generation Sequencing based genomics. The combined and integrated analysis of these two high-throughput technologies can help discover novel proteins using genomic, and transcriptomic data. Due to the biological significance of integrated analysis, the recent past has seen an influx of proteogenomic tools that perform various tasks, including mapping proteins to the genomic data, searching experimental MS spectra against a six-frame translation genome database, and automating the process of annotating genome sequences. To date, most of such tools have not focused on scalability issues that are inherent in proteogenomic data analysis where the size of the database is much larger than a typical protein database. These state-of-the-art tools can take more than half a month to process a small-scale dataset of one million spectra against a genome of 3 GB. In this article, we provide an up-to-date review of tools that can analyze proteogenomic datasets, providing a critical analysis of the techniques' relative merits and potential pitfalls. We also point out potential bottlenecks and recommendations that can be incorporated in the future design of these workflows to ensure scalability with the increasing size of proteogenomic data. Lastly, we make a case of how high-performance computing (HPC) solutions may be the best bet to ensure the scalability of future big data proteogenomic data analysis.
Collapse
Affiliation(s)
- Muhammad Usman Tariq
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Muhammad Haseeb
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Mohammed Aledhari
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Rehma Razzak
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Reza M Parizi
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
6
|
Kim HW, Choi SY, Jang HS, Ryu B, Sung SH, Yang H. Exploring novel secondary metabolites from natural products using pre-processed mass spectral data. Sci Rep 2019; 9:17430. [PMID: 31758082 PMCID: PMC6874550 DOI: 10.1038/s41598-019-54078-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 11/08/2019] [Indexed: 02/04/2023] Open
Abstract
Many natural product chemists are working to identify a wide variety of novel secondary metabolites from natural materials and are eager to avoid repeatedly discovering known compounds. Here, we developed liquid chromatography/mass spectrometry (LC/MS) data-processing protocols for assessing high-throughput spectral data from natural sources and scoring the novelty of unknown metabolites from natural products. This approach automatically produces representative MS spectra (RMSs) corresponding to single secondary metabolites in natural sources. In this study, we used the RMSs of Agrimonia pilosa roots and aerial parts as models to reveal the structural similarities of their secondary metabolites and identify novel compounds, as well as isolation of three types of nine new compounds including three pilosanidin- and four pilosanol-type molecules and two 3-hydroxy-3-methylglutaryl (HMG)-conjugated chromones. Furthermore, we devised a new scoring system, the Fresh Compound Index (FCI), which grades the novelty of single secondary metabolites from a natural material using an in-house database constructed from 466 representative medicinal plants from East Asian countries. We expect that the FCIs of RMSs in a sample will help natural product chemists to discover other compounds of interest with similar chemical scaffolds or novel compounds and will provide insights relevant to the structural diversity and novelty of secondary metabolites in natural products.
Collapse
Affiliation(s)
- Hyun Woo Kim
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, 08826, Korea
| | - Seong Yeon Choi
- Laboratory of Natural Products Chemistry, College of Pharmacy, Kangwon National University, Chuncheon, 24341, Korea
| | - Hyeon Seok Jang
- Laboratory of Natural Products Chemistry, College of Pharmacy, Kangwon National University, Chuncheon, 24341, Korea
| | - Byeol Ryu
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, 08826, Korea
| | - Sang Hyun Sung
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, 08826, Korea
| | - Heejung Yang
- Laboratory of Natural Products Chemistry, College of Pharmacy, Kangwon National University, Chuncheon, 24341, Korea.
| |
Collapse
|
7
|
Prakash A, Ahmad S, Majumder S, Jenkins C, Orsburn B. Bolt: a New Age Peptide Search Engine for Comprehensive MS/MS Sequencing Through Vast Protein Databases in Minutes. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2019; 30:2408-2418. [PMID: 31452088 DOI: 10.1007/s13361-019-02306-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 07/24/2019] [Accepted: 07/25/2019] [Indexed: 06/10/2023]
Abstract
Recent increases in mass spectrometry speed, sensitivity, and resolution now permit comprehensive proteomics coverage. However, the results are often hindered by sub-optimal data processing pipelines. In almost all MS/MS peptide search engines, users must limit their search space to a canonical database due to time constraints and q value considerations, but this typically does not reflect the individual genetic variations of the organism being studied. In addition, engines will nearly always assume the presence of only fully tryptic peptides and limit PTMs to a handful. Even on high-performance servers, these search engines are computationally expensive, and most users decide to dial back their search parameters. We present Bolt, a new cloud-based search engine that can search more than 900,000 protein sequences (canonical, isoform, mutations, and contaminants) with 41 post-translation modifications and N-terminal and C-terminal partial tryptic search in minutes on a standard configuration laptop. Along with increases in speed, Bolt provides an additional benefit of improvement in high-confidence identifications. Sixty-one percent of peptides uniquely identified by Bolt may be validated by strong fragmentation patterns, compared with 13% of peptides uniquely identified by SEQUEST and 6% of peptides uniquely identified by Mascot. Furthermore, 30% of unique Bolt identifications were verified by all three software on the longer gradient analysis, compared with only 20% and 27% for SEQUEST and Mascot identifications respectively. Bolt represents, to the best of our knowledge, the first fully scalable, cloud-based quantitative proteomic solution that can be operated within a user-friendly GUI interface. Data are available via ProteomeXchange with identifier PXD012700.
Collapse
Affiliation(s)
- Amol Prakash
- Optys Tech Corporation, Shrewsbury, MA, 01545, USA.
| | - Shadab Ahmad
- Optys Tech Corporation, Shrewsbury, MA, 01545, USA
| | | | - Conor Jenkins
- Department of Biology, Hood College, Frederick, MD, 21701, USA
| | - Ben Orsburn
- Proteomic und Genomic Sciences, Baltimore, MD, 21214, USA
| |
Collapse
|
8
|
Deng Y, Ren Z, Pan Q, Qi D, Wen B, Ren Y, Yang H, Wu L, Chen F, Liu S. pClean: An Algorithm To Preprocess High-Resolution Tandem Mass Spectra for Database Searching. J Proteome Res 2019; 18:3235-3244. [PMID: 31364357 DOI: 10.1021/acs.jproteome.9b00141] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Database searches of MS/MS spectra are the main approach to peptide/protein identification in proteomics. Since most database search engines only utilize a small portion of the original MS/MS signals for peptide detection, how to improve the quality of MS/MS signals is a primary concern for enhancement of the peptide/protein identification rate. A fundamental issue is that some noise MS signals, informative or uninformative, have to be filtered out prior to database searching. Herein, an integrative preprocessing algorithm was designed, termed pClean, which incorporates three modules to preprocess MS/MS spectra, such as the removal of isobaric-labeling related ions, the reduction in isotopic peaks, the deconvolution of ions with higher charges, and the clearance of uninformative MS/MS signals. In contrast to the currently available approaches to MS/MS data preprocessing, pClean enables treatment of MS/MS spectra with high mass accuracy and favors filtering for the labeling or nonlabeling of peptides. Data sets at various scales gained from mass spectrometers with high resolution were used to assess the quality of peptides identified after pClean treatment and to compare the pClean improvement with those of other software programs. On the basis of the analysis of peptides identified and the Mascot ion score, pClean was proven to be effective in the removal of mass spectral noise and the reduction of random matching. Compared with other software programs, pClean appeared to be beneficial in terms of preprocessing performances for the enhancement of confidence scores and the increase in peptides identified. pClean is available at https://github.com/AimeeD90/pClean_release .
Collapse
Affiliation(s)
- Yamei Deng
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China.,University of the Chinese Academy of Sciences , Beijing 100049 , China.,BGI-Shenzhen , Shenzhen 518083 , China
| | - Zhe Ren
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| | - Qingfei Pan
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China.,University of the Chinese Academy of Sciences , Beijing 100049 , China.,BGI-Shenzhen , Shenzhen 518083 , China
| | - Da Qi
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| | | | - Yan Ren
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| | - Huanming Yang
- BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China.,James D. Watson Institute of Genome Sciences , Hangzhou 310058 , China
| | - Lin Wu
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China
| | - Fei Chen
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China
| | - Siqi Liu
- CAS Key Laboratory of Genome Sciences and Information , Beijing Institute of Genomics, Chinese Academy of Sciences , Beijing 100101 , China.,BGI-Shenzhen , Shenzhen 518083 , China.,China National GeneBank, BGI-Shenzhen , Shenzhen 518120 , China
| |
Collapse
|
9
|
Awan MG, Saeed F. MaSS-Simulator: A Highly Configurable Simulator for Generating MS/MS Datasets for Benchmarking of Proteomics Algorithms. Proteomics 2018; 18:e1800206. [PMID: 30216669 PMCID: PMC6400488 DOI: 10.1002/pmic.201800206] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 08/23/2018] [Indexed: 11/11/2022]
Abstract
Mass Spectrometry (MS)-based proteomics has become an essential tool in the study of proteins. With the advent of modern MS machines huge amounts of data is being generated, which can only be processed by novel algorithmic tools. However, in the absence of data benchmarks and ground truth datasets algorithmic integrity testing and reproducibility is a challenging problem. To this end, MaSS-Simulator has been presented, which is an easy to use simulator and can be configured to simulate MS/MS datasets for a wide variety of conditions with known ground truths. MaSS-Simulator offers many configuration options to allow the user a great degree of control over the test datasets, which can enable rigorous and large- scale testing of any proteomics algorithm. MaSS-Simulator is assessed by comparing its performance against experimentally generated spectra and spectra obtained from NIST collections of spectral library. The results show that MaSS-Simulator generated spectra match closely with real-spectra and have a relative-error distribution centered around 25%. In contrast, the theoretical spectra for same peptides have relative-error distribution centered around 150%. MaSS-Simulator will enable developers to specifically highlight the capabilities of their algorithms and provide a strong proof of any pitfalls they might face. Source code, executables, and a user manual for MaSS-Simulator can be downloaded from https://github.com/pcdslab/MaSS-Simulator.
Collapse
Affiliation(s)
- Muaaz Gul Awan
- Department of Computer Science, Western Michigan University, MI, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL, USA
| |
Collapse
|
10
|
Awan MG, Eslami T, Saeed F. GPU-DAEMON: GPU algorithm design, data management & optimization template for array based big omics data. Comput Biol Med 2018; 101:163-173. [PMID: 30145436 DOI: 10.1016/j.compbiomed.2018.08.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Revised: 08/10/2018] [Accepted: 08/12/2018] [Indexed: 11/29/2022]
Abstract
In the age of ever increasing data, faster and more efficient data processing algorithms are needed. Graphics Processing Units (GPU) are emerging as a cost-effective alternative architecture for high-end computing. The optimal design of GPU algorithms is a challenging task which requires thorough understanding of the high performance computing architecture as well as the algorithmic design. The steep learning curve needed for effective GPU-centric algorithm design and implementation requires considerable expertise, time, and resources. In this paper, we present GPU-DAEMON, a GPU Data Management, Algorithm Design and Optimization technique suitable for processing array based big omics data. Our proposed GPU algorithm design template outlines and provides generic methods to tackle critical bottlenecks which can be followed to implement high performance, scalable GPU algorithms for given big data problem. We study the capability of GPU-DAEMON by reviewing the implementation of GPU-DAEMON based algorithms for three different big data problems. Speed up of as large as 386x (over the sequential version) and 50x (over naive GPU design methods) are observed using the proposed GPU-DAEMON. GPU-DAEMON template is available at https://github.com/pcdslab/GPU-DAEMON and the source codes for GPU-ArraySort, G-MSR and GPU-PCC are available at https://github.com/pcdslab.
Collapse
Affiliation(s)
- Muaaz Gul Awan
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Taban Eslami
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Fahad Saeed
- School of Computing and Information Sciences, Florida International University, Miami, FL, USA.
| |
Collapse
|
11
|
Awan MG, Saeed F. An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2017; 2017:550-555. [PMID: 28868521 DOI: 10.1145/3107411.3107466] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR.
Collapse
Affiliation(s)
- Muaaz Gul Awan
- Department of Computer Science, Western Michigan University, 4601 Campus Drive, Kalamazoo, Michigan 49009,
| | - Fahad Saeed
- Department of Computer Science, Western Michigan University, 4601 Campus Drive, Kalamazoo, Michigan 49009,
| |
Collapse
|