Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

23
(from Reference Citation Analysis)

Article PDFs (8)

Cited by > 0 (18)

Searched Name

GPU acceleration

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Collapse

.
Number	Citation Analysis
1	Efficient end-to-end simulation of time-dependent coherent X-ray scattering experiments. JOURNAL OF SYNCHROTRON RADIATION 2024;31:517-526. [PMID: 38517755 DOI: 10.1107/s1600577524001267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 02/07/2024] [Indexed: 03/24/2024] Abstract Physical optics simulations for beamlines and experiments allow users to test experiment feasibility and optimize beamline settings ahead of beam time in order to optimize valuable beam time at synchrotron light sources like NSLS-II. Further, such simulations also help to develop and test experimental data processing methods and software in advance. The Synchrotron Radiation Workshop (SRW) software package supports such complex simulations. We demonstrate how recent developments in SRW significantly improve the efficiency of physical optics simulations, such as end-to-end simulations of time-dependent X-ray photon correlation spectroscopy experiments with partially coherent undulator radiation (UR). The molecular dynamics simulation code LAMMPS was chosen to model the sample: a solution of silica nanoparticles in water at room temperature. Real-space distributions of nanoparticles produced by LAMMPS were imported into SRW and used to simulate scattering patterns of partially coherent hard X-ray UR from such a sample at the detector. The partially coherent UR illuminating the sample can be represented by a set of orthogonal coherent modes obtained by simulation of emission and propagation of this radiation through the coherent hard X-ray (CHX) scattering beamline followed by a coherent-mode decomposition. GPU acceleration is added for several key functions of SRW used in propagation from sample to detector, further improving the speed of the calculations. The accuracy of this simulation is benchmarked by comparison with experimental data. Collapse Key Words GPU acceleration Synchrotron Radiation Workshop X-ray photon correlation spectroscopy coherent-mode decomposition Collapse MESH Headings Collapse Grants DE-SC0012704 US Department of Energy, Office of Science PS-017 US Department of Energy, Office of Science Collapse
2	Extensive Angular Sampling Enables the Sensitive Localization of Macromolecules in Electron Tomograms. Int J Mol Sci 2023;24:13375. [PMID: 37686180 PMCID: PMC10487639 DOI: 10.3390/ijms241713375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 08/23/2023] [Accepted: 08/24/2023] [Indexed: 09/10/2023] Open Abstract Cryo-electron tomography provides 3D images of macromolecules in their cellular context. To detect macromolecules in tomograms, template matching (TM) is often used, which uses 3D models that are often reliable for substantial parts of the macromolecules. However, the extent of rotational searches in particle detection has not been investigated due to computational limitations. Here, we provide a GPU implementation of TM as part of the PyTOM software package, which drastically speeds up the orientational search and allows for sampling beyond the Crowther criterion within a feasible timeframe. We quantify the improvements in sensitivity and false-discovery rate for the examples of ribosome identification and detection. Sampling at the Crowther criterion, which was effectively impossible with CPU implementations due to the extensive computation times, allows for automated extraction with high sensitivity. Consequently, we also show that an extensive angular sample renders 3D TM sensitive to the local alignment of tilt series and damage induced by focused ion beam milling. With this new release of PyTOM, we focused on integration with other software packages that support more refined subtomogram-averaging workflows. The automated classification of ribosomes by TM with appropriate angular sampling on locally corrected tomograms has a sufficiently low false-discovery rate, allowing for it to be directly used for high-resolution averaging and adequate sensitivity to reveal polysome organization. Collapse Key Words GPU acceleration electron cryo-tomography particle localization and identification template matching volume registration Collapse MESH Headings Electrons Macromolecular Substances Electron Microscope Tomography Polyribosomes Ribosomes Collapse Grants 724425 European Research Council Collapse
3	Recent Developments in Ultralarge and Structure-Based Virtual Screening Approaches. Annu Rev Biomed Data Sci 2023;6:229-258. [PMID: 37220305 DOI: 10.1146/annurev-biodatasci-020222-025013] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023] Abstract Drug development is a wide scientific field that faces many challenges these days. Among them are extremely high development costs, long development times, and a small number of new drugs that are approved each year. New and innovative technologies are needed to solve these problems that make the drug discovery process of small molecules more time and cost efficient, and that allow previously undruggable receptor classes to be targeted, such as protein-protein interactions. Structure-based virtual screenings (SBVSs) have become a leading contender in this context. In this review, we give an introduction to the foundations of SBVSs and survey their progress in the past few years with a focus on ultralarge virtual screenings (ULVSs). We outline key principles of SBVSs, recent success stories, new screening techniques, available deep learning-based docking methods, and promising future research directions. ULVSs have an enormous potential for the development of new small-molecule drugs and are already starting to transform early-stage drug discovery. Collapse Key Words GPU acceleration deep learning drug discovery free energy simulations machine learning molecular docking structure-based virtual screenings ultralarge libraries Collapse MESH Headings Drug Discovery/methods High-Throughput Screening Assays Deep Learning Molecular Docking Simulation Collapse Grants Collapse
4	Accelerating genomic workflows using NVIDIA Parabricks. BMC Bioinformatics 2023;24:221. [PMID: 37259021 PMCID: PMC10230726 DOI: 10.1186/s12859-023-05292-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 04/15/2023] [Indexed: 06/02/2023] Open Abstract BACKGROUND As genome sequencing becomes better integrated into scientific research, government policy, and personalized medicine, the primary challenge for researchers is shifting from generating raw data to analyzing these vast datasets. Although much work has been done to reduce compute times using various configurations of traditional CPU computing infrastructures, Graphics Processing Units (GPUs) offer opportunities to accelerate genomic workflows by orders of magnitude. Here we benchmark one GPU-accelerated software suite called NVIDIA Parabricks on Amazon Web Services (AWS), Google Cloud Platform (GCP), and an NVIDIA DGX cluster. We benchmarked six variant calling pipelines, including two germline callers (HaplotypeCaller and DeepVariant) and four somatic callers (Mutect2, Muse, LoFreq, SomaticSniper). RESULTS We achieved up to 65 × acceleration with germline variant callers, bringing HaplotypeCaller runtimes down from 36 h to 33 min on AWS, 35 min on GCP, and 24 min on the NVIDIA DGX. Somatic callers exhibited more variation between the number of GPUs and computing platforms. On cloud platforms, GPU-accelerated germline callers resulted in cost savings compared with CPU runs, whereas some somatic callers were more expensive than CPU runs because their GPU acceleration was not sufficient to overcome the increased GPU cost. CONCLUSIONS Germline variant callers scaled well with the number of GPUs across platforms, whereas somatic variant callers exhibited more variation in the number of GPUs with the fastest runtimes, suggesting that, at least with the version of Parabricks used here, these workflows are less GPU optimized and require benchmarking on the platform of choice before being deployed at production scales. Our study demonstrates that GPUs can be used to greatly accelerate genomic workflows, thus bringing closer to grasp urgent societal advances in the areas of biosurveillance and personalized medicine. Collapse Key Words Amazon Web Services Cloud computing GPU acceleration Google Cloud Platform NVIDIA Parabricks Collapse MESH Headings Workflow Computer Graphics Software Genomics Collapse Grants Collapse
5	An Accelerated Pipeline for Multi-label Renal Pathology Image Segmentation at the Whole Slide Image Level. PROCEEDINGS OF SPIE--THE INTERNATIONAL SOCIETY FOR OPTICAL ENGINEERING 2023;12471:124710Q. [PMID: 38606193 PMCID: PMC11008744 DOI: 10.1117/12.2653651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/13/2024] Abstract Deep-learning techniques have been used widely to alleviate the labour-intensive and time-consuming manual annotation required for pixel-level tissue characterization. Our previous study introduced an efficient single dynamic network - Omni-Seg - that achieved multi-class multi-scale pathological segmentation with less computational complexity. However, the patch-wise segmentation paradigm still applies to Omni-Seg, and the pipeline is time-consuming when providing segmentation for Whole Slide Images (WSIs). In this paper, we propose an enhanced version of the Omni-Seg pipeline in order to reduce the repetitive computing processes and utilize a GPU to accelerate the model's prediction for both better model performance and faster speed. Our proposed method's innovative contribution is two-fold: (1) a Docker is released for an end-to-end slide-wise multi-tissue segmentation for WSIs; and (2) the pipeline is deployed on a GPU to accelerate the prediction, achieving better segmentation quality in less time. The proposed accelerated implementation reduced the average processing time (at the testing stage) on a standard needle biopsy WSI from 2.3 hours to 22 minutes, using 35 WSIs from the Kidney Tissue Atlas (KPMP) Datasets. The source code and the Docker have been made publicly available at https://github.com/ddrrnn123/Omni-Seg. Collapse Key Words Docker GPU acceleration Collapse MESH Headings Collapse Grants R01 DK135597 NIDDK NIH HHS Collapse
6	Development and evaluation of a GPU-based coupled three-dimensional hydrodynamic and water quality model. MARINE POLLUTION BULLETIN 2023;187:114494. [PMID: 36581522 DOI: 10.1016/j.marpolbul.2022.114494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 10/15/2022] [Accepted: 12/13/2022] [Indexed: 06/17/2023] Abstract In this study, a graphics processing unit (GPU)-based three-dimensional coupled hydrodynamic and water quality numerical model (GPUOM-WQ) was developed for the first time, which introduces pollution sources of atmospheric deposition, aquaculture wastewater, and oil platform emission to describe marine pollution comprehensively. A test case with analytical solutions and a real case with measured data were used to validate the accuracy of GPUOM-WQ. Simulation results indicate that the maximum error between the numerical and analytical solutions is 0.9 %, and the average relative error between simulated and measured values of 5 water quality variables at 38 stations in spring, summer, fall and winter is 14.63 %. In the real case simulation, GPUOM-WQ accelerates the computation 62.48 times, which is 3.23 times faster than in 64-core central processing unit (CPU) parallel mode. This study makes it possible to accurately simulate the marine water quality variation and spatiotemporal distribution in a high-resolution and efficient way. Collapse Key Words Coupled hydrodynamic and water quality model GPU acceleration Marine pollution source Three-dimensional Collapse MESH Headings Computer Graphics Hydrodynamics Water Quality Computer Simulation Algorithms Collapse Grants Collapse
7	Virtual particle Monte Carlo: A new concept to avoid simulating secondary particles in proton therapy dose calculation. Med Phys 2022;49:6666-6683. [PMID: 35960865 PMCID: PMC9588716 DOI: 10.1002/mp.15913] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 11/07/2022] Open Abstract BACKGROUND In proton therapy dose calculation, Monte Carlo (MC) simulations are superior in accuracy but more time consuming, compared to analytical calculations. Graphic processing units (GPUs) are effective in accelerating MC simulations but may suffer thread divergence and racing condition in GPU threads that degrades the computing performance due to the generation of secondary particles during nuclear reactions. PURPOSE A novel concept of virtual particle (VP) MC (VPMC) is proposed to avoid simulating secondary particles in GPU-accelerated proton MC dose calculation and take full advantage of the computing power of GPU. METHODS Neutrons and gamma rays were ignored as escaping from the human body; doses of electrons, heavy ions, and nuclear fragments were locally deposited; the tracks of deuterons were converted into tracks of protons. These particles, together with primary and secondary protons, are considered to be the realistic particles. Histories of primary and secondary protons were replaced by histories of multiple VPs. Each VP corresponded to one proton (either primary or secondary). A continuous-slowing-down-approximation model, an ionization model, and a large angle scattering event model corresponding to nuclear interactions were developed for VPs by generating probability distribution functions (PDFs) based on simulation results of realistic particles using MCsquare. For efficient calculations, these PDFs were stored in the Compute Unified Device Architecture textures. VPMC was benchmarked with TOPAS and MCsquare in phantoms and with MCsquare in 13 representative patient geometries. Comparisons between the VPMC calculated dose and dose measured in water during patient-specific quality assurance (PSQA) of the selected 13 patients were also carried out. Gamma analysis was used to compare the doses derived from different methods and calculation efficiencies were also compared. RESULTS Integrated depth dose and lateral dose profiles in both homogeneous and inhomogeneous phantoms all matched well among VPMC, TOPAS, and MCsquare calculations. The 3D-3D gamma passing rates with a criterion of 2%/2 mm and a threshold of 10% was 98.49% between MCsquare and TOPAS and 98.31% between VPMC and TOPAS in homogeneous phantoms, and 99.18% between MCsquare and TOPAS and 98.49% between VPMC and TOPAS in inhomogeneous phantoms, respectively. In patient geometries, the 3D-3D gamma passing rates with 2%/2 mm/10% between dose distributions from VPMC and MCsquare were 98.56 ± 1.09% in patient geometries. The 2D-3D gamma analysis with 3%/2 mm/10% between the VPMC calculated dose distributions and the 2D measured planar dose distributions during PSQA was 98.91 ± 0.88%. VPMC calculation was highly efficient and took 2.84 ± 2.44 s to finish for the selected 13 patients running on four NVIDIA Ampere GPUs in patient geometries. CONCLUSION VPMC was found to achieve high accuracy and efficiency in proton therapy dose calculation. Collapse Key Words GPU acceleration Monte Carlo intensity-modulated proton therapy real-time adaptive treatment planning secondary particles Collapse MESH Headings Deuterium Humans Monte Carlo Method Proton Therapy/methods Protons Water Collapse Grants K25 CA168984 NCI NIH HHS the Kemper Marley Foundation Arizona Biomedical Research Commission Investigator Award K25CA168984 National Cancer Institute (NCI) Career Developmental Award Lawrence W. and Marilyn W. Matteson Fund for Cancer Research Collapse
8	AreTomo: An integrated software package for automated marker-free, motion-corrected cryo-electron tomographic alignment and reconstruction. J Struct Biol X 2022;6:100068. [PMID: 35601683 PMCID: PMC9117686 DOI: 10.1016/j.yjsbx.2022.100068] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Accepted: 05/02/2022] [Indexed: 11/28/2022] Open Abstract • AreTomo, a GPU accelerated software package, fully automates motion-corrected marker-free tomographic alignment and reconstruction. • AreTomo can produce tomograms with sufficient accuracy to be directly used for subtomogram averaging. • AreTomo enables the on-the-fly reconstruction of tomograms in parallel with tilt series collection, thus providing users with real-time feedback of sample quality. AreTomo, an abbreviation for Alignment and Reconstruction for Electron Tomography, is a GPU accelerated software package that fully automates motion-corrected marker-free tomographic alignment and reconstruction in a single package. By correcting in-plane rotation, translation, and importantly, the local motion resulting from beam-induced motion from tilt to tilt, AreTomo can produce tomograms with sufficient accuracy to be directly used for subtomogram averaging. Another major application is the on-the-fly reconstruction of tomograms in parallel with tilt series collection to provide users with real-time feedback of sample quality allowing users to make any necessary adjustments of collection parameters. Here, the multiple alignment algorithms implemented in AreTomo are described and the local motions measured on a typical tilt series are analyzed. The residual local motion after correction for global motion was found in the range of ± 80 Å, indicating that the accurate correction of local motion is critical for high-resolution cryo-electron tomography (cryoET). Collapse Key Words Electron tomography GPU acceleration Local beam-induced motion Marker-free alignment Tomographic alignment Tomographic reconstruction Collapse MESH Headings Collapse Grants R35 GM140847 NIGMS NIH HHS S10 OD021741 NIH HHS S10 OD026881 NIH HHS Collapse
9	GPU-accelerated multitiered iterative phasing algorithm for fluctuation X-ray scattering. J Appl Crystallogr 2021;54:1179-1188. [PMID: 34429723 PMCID: PMC8366419 DOI: 10.1107/s1600576721005744] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 06/02/2021] [Indexed: 11/16/2022] Open Abstract The multitiered iterative phasing (MTIP) algorithm is used to determine the biological structures of macromolecules from fluctuation scattering data. It is an iterative algorithm that reconstructs the electron density of the sample by matching the computed fluctuation X-ray scattering data to the external observations, and by simultaneously enforcing constraints in real and Fourier space. This paper presents the first ever MTIP algorithm acceleration efforts on contemporary graphics processing units (GPUs). The Compute Unified Device Architecture (CUDA) programming model is used to accelerate the MTIP algorithm on NVIDIA GPUs. The computational performance of the CUDA-based MTIP algorithm implementation outperforms the CPU-based version by an order of magnitude. Furthermore, the Heterogeneous-Compute Interface for Portability (HIP) runtime APIs are used to demonstrate portability by accelerating the MTIP algorithm across NVIDIA and AMD GPUs. Collapse Key Words AMD GPUs CUDA programming GPU acceleration HIP programming NVIDIA GPUs fluctuation X-ray scattering multitiered iterative phasing polar Fourier transform spherical harmonic transform Collapse MESH Headings Collapse Grants R01 GM109019 NIGMS NIH HHS Collapse
10	Transverse-to-transverse diffuse ultrasonic double scattering. ULTRASONICS 2021;111:106301. [PMID: 33316642 DOI: 10.1016/j.ultras.2020.106301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Revised: 10/01/2020] [Accepted: 11/10/2020] [Indexed: 06/12/2023] Abstract Previously, a transverse-to-transverse single scattering model (T-T SSR) was developed for a pulse echo configuration, which may have limitations for strongly scattering materials. In this work, a transverse-to-transverse double scattering model (T-T DSR) is presented to model the transverse ultrasonic backscatter more accurately. First, the Wigner distribution of the transducer beam pattern is extended to a transverse wave. Next, the multiple scattering framework is followed to derive the transverse and longitudinal components of the second-order scattering. Then, a quasi-Monte Carlo (QMC) method is used with Graphics Processing Unit (GPU) acceleration to calculate numerical results of the final expression which contains a five-dimensional integral. The correlation length, the focal length of the transducer, and incident angle are used to investigate differences between the T-T DSR model and the T-T SSR model. Finally, a backscatter experiment is performed on two stainless steel specimens with different grain sizes to determine the respective correlation lengths. The results show that the T-T DSR model has better performance over the T-T SSR model for evaluating the grain size of these relatively strongly-scattering specimens. Collapse Key Words GPU acceleration Quasi-Monte Carlo Strongly scattering material Transverse-to-transverse (T-T) double scattering Collapse MESH Headings Collapse Grants Collapse
11	NormiRazor: tool applying GPU-accelerated computing for determination of internal references in microRNA transcription studies. BMC Bioinformatics 2020;21:425. [PMID: 32993488 PMCID: PMC7523363 DOI: 10.1186/s12859-020-03743-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Accepted: 09/07/2020] [Indexed: 02/06/2023] Open Abstract BACKGROUND Multi-gene expression assays are an attractive tool in revealing complex regulatory mechanisms in living organisms. Normalization is an indispensable step of data analysis in all those studies, since it removes unwanted, non-biological variability from data. In targeted qPCR assays it is typically performed with respect to prespecified reference genes, but the lack of robust strategy of their selection is reported in literature, especially in studies concerning circulating microRNAs (miRNA). Unfortunately, this problem impedes translation of scientific discoveries on miRNA biomarkers into widely available laboratory assays. Previous studies concluded that averaged expressions of multi-miRNA combinations are more stable references than single genes. However, due to the number of such combinations the computational load is considerable and may be hindering for objective reference selection in large datasets. Existing implementations of normalization algorithms (geNorm, NormFinder and BestKeeper) have poor performance and may require days to compute stability values for all potential reference as the evaluation is performed sequentially. RESULTS We designed NormiRazor - an integrative tool which implements those methods in a parallel manner on a graphics processing unit (GPU) using CUDA platform. We tested our approach on publicly available miRNA expression datasets. As a result, the times of executions on 8 datasets containing from 50 to 400 miRNAs (subsets of GSE68314) decreased 18.7 ±0.6 (mean ±SD), 104.7 ±4.2 and 76.5 ±2.2 times for geNorm, BestKeeper and NormFinder with respect to previous Python implementation. To allow for easy access to normalization pipeline for biomedical researchers we implemented NormiRazor as an online platform where a user could normalize their datasets based on the automatically selected references. It is available at norm.btm.umed.pl, together with instruction manual and exemplary datasets. CONCLUSIONS NormiRazor allows for an easy, informed choice of reference genes for qPCR transcriptomic studies. As such it can improve comparability and repeatability of experiments and in longer perspective help translate newly discovered biomarkers into readily available assays. Collapse Key Words CUDA GPU acceleration Reference genes miRNA qPRC Collapse MESH Headings Cell Line, Tumor Gene Expression Humans MicroRNAs/genetics MicroRNAs/metabolism Real-Time Polymerase Chain Reaction/methods User-Computer Interface Collapse Grants Collapse
12	GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads. BMC Bioinformatics 2020;21:388. [PMID: 32938392 PMCID: PMC7495891 DOI: 10.1186/s12859-020-03685-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open Abstract Background In Overlap-Layout-Consensus (OLC) based de novo assembly, all reads must be compared with every other read to find overlaps. This makes the process rather slow and limits the practicality of using de novo assembly methods at a large scale in the field. Darwin is a fast and accurate read overlapper that can be used for de novo assembly of state-of-the-art third generation long DNA reads. Darwin is designed to be hardware-friendly and can be accelerated on specialized computer system hardware to achieve higher performance. Results This work accelerates Darwin on GPUs. Using real Pacbio data, our GPU implementation on Tesla K40 has shown a speedup of 109x vs 8 CPU threads of an Intel Xeon machine and 24x vs 64 threads of IBM Power8 machine. The GPU implementation supports both linear and affine gap, scoring model. The results show that the GPU implementation can achieve the same high speedup for different scoring schemes. Conclusions The GPU implementation proposed in this work shows significant improvement in performance compared to the CPU version, thereby making it accessible for utilization as a practical read overlapper in a DNA assembly pipeline. Furthermore, our GPU acceleration can also be used for performing fast Smith-Waterman alignment between long DNA reads. GPU hardware has become commonly available in the field today, making the proposed acceleration accessible to a larger public. The implementation is available at https://github.com/Tongdongq/darwin-gpu. Collapse Key Words De novo assembly GPU acceleration Genomics Long DNA reads Read overlapper Collapse MESH Headings Collapse Grants Collapse
13	GPU accelerated adaptive banded event alignment for rapid comparative nanopore signal analysis. BMC Bioinformatics 2020;21:343. [PMID: 32758139 PMCID: PMC7430849 DOI: 10.1186/s12859-020-03697-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 07/23/2020] [Indexed: 11/10/2022] Open Abstract BACKGROUND Nanopore sequencing enables portable, real-time sequencing applications, including point-of-care diagnostics and in-the-field genotyping. Achieving these outcomes requires efficient bioinformatic algorithms for the analysis of raw nanopore signal data. However, comparing raw nanopore signals to a biological reference sequence is a computationally complex task. The dynamic programming algorithm called Adaptive Banded Event Alignment (ABEA) is a crucial step in polishing sequencing data and identifying non-standard nucleotides, such as measuring DNA methylation. Here, we parallelise and optimise an implementation of the ABEA algorithm (termed f5c) to efficiently run on heterogeneous CPU-GPU architectures. RESULTS By optimising memory, computations and load balancing between CPU and GPU, we demonstrate how f5c can perform ∼3-5 × faster than an optimised version of the original CPU-only implementation of ABEA in the Nanopolish software package. We also show that f5c enables DNA methylation detection on-the-fly using an embedded System on Chip (SoC) equipped with GPUs. CONCLUSIONS Our work not only demonstrates that complex genomics analyses can be performed on lightweight computing systems, but also benefits High-Performance Computing (HPC). The associated source code for f5c along with GPU optimised ABEA is available at https://github.com/hasindu2008/f5c . Collapse Key Words Event alignment GPU GPU acceleration Methylation Nanopolish Nanopore Optimisation Signal alignment SoC f5c Collapse MESH Headings Collapse Grants Collapse
14	Evaluation of quantitative, efficient image reconstruction for VersaPET, a compact PET system. Med Phys 2020;47:2852-2868. [PMID: 32219853 DOI: 10.1002/mp.14158] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2019] [Revised: 03/13/2020] [Accepted: 03/13/2020] [Indexed: 11/09/2022] Open Abstract PURPOSE Previously we developed a high-resolution positron emission tomography (PET) system-VersaPET-characterized by a block geometry with relatively large axial and transaxial interblock gaps and a compact geometry susceptible to parallax blurring effects. In this work, we report the qualitative and quantitative evaluation of a graphic processing unit (GPU)-accelerated maximum-likelihood by expectation-maximization (MLEM) image reconstruction framework for VersaPET which features accurate system geometry and projection space point-spread-function (PSF) modeling. METHODS We combined the ray-tracing module from software for tomographic image reconstruction (STIR), an open-source PET image reconstruction package, with VersaPET's exact block geometry for the geometric system matrix. Point-spread-function modeling of crystal penetration and scattering was achieved by a custom Monte-Carlo simulation for projection space blurring in all dimensions. We also parallelized the reconstruction in GPU taking advantage of the system's symmetry for PSF computation. To investigate the effects of PSF width, we generated and studied multiple kernels between one that reflects the true LYSO density in the MC simulation and another that reflects geometry only (no PSF). GATE simulations of hot and cold-sphere phantoms with spheres of different sizes, real microDerenzo phantom, and human blood vessel data were used to characterize the quantitative and qualitative performances of the reconstruction. RESULTS Reconstruction with an accurate system geometry effectively improved image quality compared to STIR (version 3.0) which assumes an idealized system geometry. Reconstructions of GATE-simulated hot-sphere phantom data showed that all PSF kernels achieved superior performance in contrast recovery and bias reduction compared to using no PSF, but may introduce edge artifact and lumped background noise pattern depending on the width of PSF kernels. Cold-sphere phantom simulation results also indicated improvement in contrast recovery and quantification with PSF modeling (compared to no PSF) for 5 and 10 mm cold spheres. Real microDerenzo phantom images with the PSF kernel that reflects the true LYSO density showed degraded resolving power of small sectors that could be resolved more clearly by underestimated PSF kernels, which is consistent with recent literature despite differences in scanner geometries and in approaches to system model estimation. The human vessel results resemble those of the hot-sphere phantom simulation with the PSF kernel that reflects the true LYSO density achieving the highest peak in the time activity curve (TAC) and similar lumped noise pattern. CONCLUSIONS We fully evaluated a practical MLEM reconstruction framework that we developed for VersaPET in terms of qualitative and quantitative performance. Different PSF kernels may be adopted for improving the results of specific imaging tasks but the underlying reasons for the variation in optimal kernel for the real and simulation studies requires further study. Collapse Key Words GPU acceleration MLEM PET image reconstruction point spread function modeling sinogram blurring kernels Collapse MESH Headings Collapse Grants Collapse
15	Accelerating iterative coordinate descent using a stored system matrix. Med Phys 2020;46:e801-e809. [PMID: 31811796 DOI: 10.1002/mp.13543] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Revised: 03/11/2019] [Accepted: 04/05/2019] [Indexed: 12/20/2022] Open Abstract PURPOSE The computational burden associated with model-based iterative reconstruction (MBIR) is still a practical limitation. Iterative coordinate descent (ICD) is an optimization approach for MBIR that has sometimes been thought to be incompatible with modern computing architectures, especially graphics processing units (GPUs). The purpose of this work is to accelerate the previously released open-source FreeCT_ICD to include GPU acceleration and to demonstrate computational performance with ICD that is comparable with simultaneous update approaches. METHODS FreeCT_ICD uses a stored system matrix (SSM), which precalculates the forward projector in the form of a sparse matrix and then reconstructs on a rotating coordinate grid to exploit helical symmetry. In our GPU ICD implementation, we shuffle the sinogram memory ordering such that data access in the sinogram coalesce into fewer transactions. We also update N_S voxels in the xy-plane simultaneously to improve occupancy. Conventional ICD updates voxels sequentially (N_S = 1). Using N_S > 1 eliminates existing convergence guarantees. Convergence behavior in a clinical dataset was therefore studied empirically. RESULTS On a pediatric dataset with sinogram size of 736 × 16 × 13860 reconstructed to a matrix size of 512 × 512 × 128, our code requires about 20 s per iteration on a single GPU compared to 2300 s per iteration for a 6-core CPU using FreeCT_ICD. After 400 iterations, the proposed and reference codes converge within 2 HU RMS difference (RMSD). Using a wFBP initialization, convergence within 10 HU RMSD is achieved within 4 min. Convergence is similar with N_S values between 1 and 256, and N_S = 16 was sufficient to achieve maximum performance. Divergence was not observed until N_S > 1024. CONCLUSIONS With appropriate modifications, ICD may be able to achieve computational performance competitive with simultaneous update algorithms currently used for MBIR. Collapse Key Words GPU acceleration iterative coordinate descent iterative reconstruction Collapse MESH Headings Collapse Grants Collapse
16	SpykeTorch: Efficient Simulation of Convolutional Spiking Neural Networks With at Most One Spike per Neuron. Front Neurosci 2019;13:625. [PMID: 31354403 PMCID: PMC6640212 DOI: 10.3389/fnins.2019.00625] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 05/31/2019] [Indexed: 11/13/2022] Open Abstract Application of deep convolutional spiking neural networks (SNNs) to artificial intelligence (AI) tasks has recently gained a lot of interest since SNNs are hardware-friendly and energy-efficient. Unlike the non-spiking counterparts, most of the existing SNN simulation frameworks are not practically efficient enough for large-scale AI tasks. In this paper, we introduce SpykeTorch, an open-source high-speed simulation framework based on PyTorch. This framework simulates convolutional SNNs with at most one spike per neuron and the rank-order encoding scheme. In terms of learning rules, both spike-timing-dependent plasticity (STDP) and reward-modulated STDP (R-STDP) are implemented, but other rules could be implemented easily. Apart from the aforementioned properties, SpykeTorch is highly generic and capable of reproducing the results of various studies. Computations in the proposed framework are tensor-based and totally done by PyTorch functions, which in turn brings the ability of just-in-time optimization for running on CPUs, GPUs, or Multi-GPU platforms. Collapse Key Words GPU acceleration STDP convolutional spiking neural networks one spike per neuron reward-modulated STDP tensor-based computing time-to-first-spike coding Collapse MESH Headings Collapse Grants Collapse
17	Efficient Acceleration of the Pair-HMMs Forward Algorithm for GATK HaplotypeCaller on Graphics Processing Units. Evol Bioinform Online 2018;14:1176934318760543. [PMID: 29568218 PMCID: PMC5858735 DOI: 10.1177/1176934318760543] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2017] [Accepted: 11/17/2017] [Indexed: 12/30/2022] Open Abstract GATK HaplotypeCaller (HC) is a popular variant caller, which is widely used to identify variants in complex genomes. However, due to its high variants detection accuracy, it suffers from long execution time. In GATK HC, the pair-HMMs forward algorithm accounts for a large percentage of the total execution time. This article proposes to accelerate the pair-HMMs forward algorithm on graphics processing units (GPUs) to improve the performance of GATK HC. This article presents several GPU-based implementations of the pair-HMMs forward algorithm. It also analyzes the performance bottlenecks of the implementations on an NVIDIA Tesla K40 card with various data sets. Based on these results and the characteristics of GATK HC, we are able to identify the GPU-based implementations with the highest performance for the various analyzed data sets. Experimental results show that the GPU-based implementations of the pair-HMMs forward algorithm achieve a speedup of up to 5.47× over existing GPU-based implementations. Collapse Key Words GATK HaplotypeCaller GPU acceleration Pair-HMMs forward algorithm memory access Collapse MESH Headings Collapse Grants Collapse
18	G.A.M.E.: GPU-accelerated mixture elucidator. J Cheminform 2017;9:50. [PMID: 29086161 PMCID: PMC5602814 DOI: 10.1186/s13321-017-0238-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Accepted: 09/05/2017] [Indexed: 11/23/2022] Open Abstract GPU acceleration is useful in solving complex chemical information problems. Identifying unknown structures from the mass spectra of natural product mixtures has been a desirable yet unresolved issue in metabolomics. However, this elucidation process has been hampered by complex experimental data and the inability of instruments to completely separate different compounds. Fortunately, with current high-resolution mass spectrometry, one feasible strategy is to define this problem as extending a scaffold database with sidechains of different probabilities to match the high-resolution mass obtained from a high-resolution mass spectrum. By introducing a dynamic programming (DP) algorithm, it is possible to solve this NP-complete problem in pseudo-polynomial time. However, the running time of the DP algorithm grows by orders of magnitude as the number of mass decimal digits increases, thus limiting the boost in structural prediction capabilities. By harnessing the heavily parallel architecture of modern GPUs, we designed a “compute unified device architecture” (CUDA)-based GPU-accelerated mixture elucidator (G.A.M.E.) that considerably improves the performance of the DP, allowing up to five decimal digits for input mass data. As exemplified by four testing datasets with verified constitutions from natural products, G.A.M.E. allows for efficient and automatic structural elucidation of unknown mixtures for practical procedures.Graphical abstract
19	Collective behavior of large-scale neural networks with GPU acceleration. Cogn Neurodyn 2017;11:553-563. [PMID: 29147147 DOI: 10.1007/s11571-017-9446-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2016] [Revised: 06/08/2017] [Accepted: 06/16/2017] [Indexed: 11/25/2022] Open Abstract In this paper, the collective behaviors of a small-world neuronal network motivated by the anatomy of a mammalian cortex based on both Izhikevich model and Rulkov model are studied. The Izhikevich model can not only reproduce the rich behaviors of biological neurons but also has only two equations and one nonlinear term. Rulkov model is in the form of difference equations that generate a sequence of membrane potential samples in discrete moments of time to improve computational efficiency. These two models are suitable for the construction of large scale neural networks. By varying some key parameters, such as the connection probability and the number of nearest neighbor of each node, the coupled neurons will exhibit types of temporal and spatial characteristics. It is demonstrated that the implementation of GPU can achieve more and more acceleration than CPU with the increasing of neuron number and iterations. These two small-world network models and GPU acceleration give us a new opportunity to reproduce the real biological network containing a large number of neurons. Collapse Key Words GPU acceleration Large-scale neural network Small-world Spatio-temporal characteristics Collapse MESH Headings Collapse Grants Collapse
20	Features extraction and multi-classification of sEMG using a GPU-Accelerated GA/MLP hybrid algorithm. JOURNAL OF X-RAY SCIENCE AND TECHNOLOGY 2017;25:273-286. [PMID: 28269817 DOI: 10.3233/xst-17259] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023] Abstract BACKGROUND Surface electromyography (sEMG) signal is the combined effect of superficial muscle EMG and neural electrical activity. In recent years, researchers did large amount of human-machine system studies by using the physiological signals as control signals. OBJECTIVE To develop and test a new multi-classification method to improve performance of analyzing sEMG signals based on public sEMG dataset. METHODS First, ten features were selected as candidate features. Second, a genetic algorithm (GA) was applied to select representative features from the initial ten candidates. Third, a multi-layer perceptron (MLP) classifier was trained by the selected optimal features. Last, the trained classifier was used to predict the classes of sEMG signals. A special graphics processing unit (GPU) was used to speed up the learning process. RESULTS Experimental results show that the classification accuracy of the new method reached higher than 90%. Comparing to other previously reported results, using the new method yielded higher performance. CONCLUSIONS The proposed features selection method is effective and the classification result is accurate. In addition, our method could have practical application value in medical prosthetics and the potential to improve robustness of myoelectric pattern recognition. Collapse Key Words GPU acceleration Surface electromyography signal biological signal processing features selection genetic algorithm multi-layer perception Collapse MESH Headings Algorithms Electromyography/methods Gestures Hand/physiology Humans Man-Machine Systems Pattern Recognition, Automated/methods Signal Processing, Computer-Assisted Support Vector Machine Collapse Grants Collapse
21	GPU accelerated dynamic respiratory motion model correction for MRI-guided cardiac interventions. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016;136:31-43. [PMID: 27686701 DOI: 10.1016/j.cmpb.2016.08.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/20/2016] [Revised: 07/10/2016] [Accepted: 08/09/2016] [Indexed: 06/06/2023] Abstract BACKGROUND AND OBJECTIVES The use of pre-procedural magnetic resonance (MR) roadmap images for interventional guidance has limited anatomical accuracy due to intra-procedural respiratory motion of the heart. Therefore, the objective of this study is to explore the use of a rapidly updated dynamic motion model to correct for respiratory motion induced errors during MRI-guided cardiac interventions. The motivation for the proposed technique is to improve the accuracy of MRI guidance by taking advantage of the anatomical context provided by the high resolution prior images and the respiratory motion information present in a series of realtime MR images. METHODS We implemented a GPU accelerated image registration algorithm to derive the respiratory motion information and used the resulting transformation parameters to update an adaptive motion model once every heart cycle. In the subsequent heart cycle, the dynamic motion model could be used to predict the respiratory motion and provide a motion estimate to realign the prior volume with the realtime MR image. This iterative update and prediction process is then continuously repeated. RESULTS The GPU accelerated image registration algorithm could be completed in an average of 176.9 ± 14.0 ms, which is 139× faster than a CPU implementation. Thus, it was feasible to update the dynamic model once every heart cycle. The proposed dynamic model was also able to improve the registration accuracy from 86.0 ± 7.5% to 93.0 ± 3.3% in case of variable breathing patterns, as evaluated by the dice similarity coefficient of the left ventricular border overlap between the prior and realtime images. CONCLUSIONS The feasibility of a dynamic motion correction framework was demonstrated. The resulting improvements may lead to more accurate MRI-guided cardiac interventions in the future. Collapse Key Words Cardiac interventions Electrophysiology GPU acceleration Magnetic resonance imaging Respiratory motion modeling Collapse MESH Headings Heart/physiology Humans Magnetic Resonance Imaging/methods Models, Biological Respiratory Mechanics Collapse Grants Collapse
22	Free energy simulations with the AMOEBA polarizable force field and metadynamics on GPU platform. J Comput Chem 2015;37:614-22. [PMID: 26493154 DOI: 10.1002/jcc.24227] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Revised: 09/14/2015] [Accepted: 09/24/2015] [Indexed: 01/08/2023] Abstract The free energy calculation library PLUMED has been incorporated into the OpenMM simulation toolkit, with the purpose to perform enhanced sampling MD simulations using the AMOEBA polarizable force field on GPU platform. Two examples, (I) the free energy profile of water pair separation (II) alanine dipeptide dihedral angle free energy surface in explicit solvent, are provided here to demonstrate the accuracy and efficiency of our implementation. The converged free energy profiles could be obtained within an affordable MD simulation time when the AMOEBA polarizable force field is employed. Moreover, the free energy surfaces estimated using the AMOEBA polarizable force field are in agreement with those calculated from experimental data and ab initio methods. Hence, the implementation in this work is reliable and would be utilized to study more complicated biological phenomena in both an accurate and efficient way. © 2015 Wiley Periodicals, Inc. Collapse Key Words AMOEBA polarizable force field GPU acceleration OpenMM PLUMED alanine dipeptide free energy simulations metadynamics water simulation Collapse MESH Headings Collapse Grants Collapse
23	BCL::SAXS: GPU accelerated Debye method for computation of small angle X-ray scattering profiles. Proteins 2015;83:1500-12. [PMID: 26018949 PMCID: PMC4797635 DOI: 10.1002/prot.24838] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Revised: 05/08/2015] [Accepted: 05/19/2015] [Indexed: 12/25/2022] Abstract Small angle X-ray scattering (SAXS) is an experimental technique used for structural characterization of macromolecules in solution. Here, we introduce BCL::SAXS--an algorithm designed to replicate SAXS profiles from rigid protein models at different levels of detail. We first show our derivation of BCL::SAXS and compare our results with the experimental scattering profile of hen egg white lysozyme. Using this protein we show how to generate SAXS profiles representing: (1) complete models, (2) models with approximated side chain coordinates, and (3) models with approximated side chain and loop region coordinates. We evaluated the ability of SAXS profiles to identify a correct protein topology from a non-redundant benchmark set of proteins. We find that complete SAXS profiles can be used to identify the correct protein by receiver operating characteristic (ROC) analysis with an area under the curve (AUC) > 99%. We show how our approximation of loop coordinates between secondary structure elements improves protein recognition by SAχS for protein models without loop regions and side chains. Agreement with SAXS data is a necessary but not sufficient condition for structure determination. We conclude that experimental SAXS data can be used as a filter to exclude protein models with large structural differences from the native. Collapse Key Words Debye formula GPU acceleration SAXS proteins Collapse MESH Headings Algorithms Humans Models, Molecular Proteins/chemistry ROC Curve Scattering, Small Angle X-Ray Diffraction/methods Collapse Grants R01 GM099842 NIGMS NIH HHS R01 GM080403 NIGMS NIH HHS T15 LM007450 NLM NIH HHS R01 MH090192 NIMH NIH HHS 5T15LM007450-09 NLM NIH HHS Collapse