1
|
Xia Z, Yang C, Peng C, Guo Y, Guo Y, Tang T, Cui Y. Fast noisy long read alignment with multi-level parallelism. BMC Bioinformatics 2025; 26:118. [PMID: 40316905 PMCID: PMC12049014 DOI: 10.1186/s12859-025-06129-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Accepted: 04/01/2025] [Indexed: 05/04/2025] Open
Abstract
BACKGROUND The advent of Single Molecule Real-Time (SMRT) sequencing has overcome many limitations of second-generation sequencing, such as limited read lengths, PCR amplification biases. However, longer reads increase data volume exponentially and high error rates make many existing alignment tools inapplicable. Additionally, a single CPU's performance bottleneck restricts the effectiveness of alignment algorithms for SMRT sequencing. RESULTS To address these challenges, we introduce ParaHAT, a parallel alignment algorithm for noisy long reads. ParaHAT utilizes vector-level, thread-level, process-level, and heterogeneous parallelism. We redesign the dynamic programming matrices layouts to eliminate data dependency in the base-level alignment, enabling effective vectorization. We further enhance computational speed through heterogeneous parallel technology and implement the algorithm for multi-node computing using MPI, overcoming the computational limits of a single node. CONCLUSIONS Performance evaluations show that ParaHAT got a 10.03x speedup in base-level alignment, with a parallel acceleration ratio and weak scalability metric of 94.61 and 98.98% on 128 nodes, respectively.
Collapse
Affiliation(s)
- Zeyu Xia
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China
| | - Canqun Yang
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China
- National Supercomputer Center in Tianjin, 300457, Tianjin, China
- Haihe Lab of ITAI, 300457, Tianjin, China
| | - Chenchen Peng
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China
| | - Yifei Guo
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China
| | - Yufei Guo
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China
| | - Tao Tang
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China
| | - Yingbo Cui
- College of Computer Science and Technology, National University of Defense Technology, 410073, Changsha, China.
| |
Collapse
|
2
|
Schmidt B, Kallenborn F, Chacon A, Hundt C. CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search. BMC Bioinformatics 2024; 25:342. [PMID: 39488701 PMCID: PMC11531700 DOI: 10.1186/s12859-024-05965-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 10/22/2024] [Indexed: 11/04/2024] Open
Abstract
BACKGROUND The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations. RESULTS CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt. CONCLUSION CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .
Collapse
Affiliation(s)
- Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany.
| | - Felix Kallenborn
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany
| | | | | |
Collapse
|
3
|
Zhang Y, Zhang Q, Zhou J, Zou Q. A survey on the algorithm and development of multiple sequence alignment. Brief Bioinform 2022; 23:6546258. [PMID: 35272347 DOI: 10.1093/bib/bbac069] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Revised: 01/30/2022] [Accepted: 02/09/2022] [Indexed: 12/21/2022] Open
Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA's knowledge, including background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics, including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics, machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics. Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for researchers to contribute their knowledge to MSA and relevant studies.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China.,School of Computer Science and Engineering, University of Electronic Science and Technology of China, 611731, Chengdu, China
| | - Qiang Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
4
|
Abstract
Multiple sequence alignment (MSA) is a central step in many bioinformatics and computational biology analyses. Although there exist many methods to perform MSA, most of them fail when dealing with large datasets due to their high computational cost. MSAProbs-MPI is a publicly available tool ( http://msaprobs.sourceforge.net ) that provides highly accurate results in relatively short runtime thanks to exploiting the hardware resources of multicore clusters. In this chapter, I explain the statistical and biological concepts employed in MSAProbs-MPI to complete the alignments, as well as the high-performance computing techniques used to accelerate it. Moreover, I provide some hints about the configuration parameters that should be used to guarantee high-performance executions.
Collapse
|
5
|
Naznooshsadat E, Elham P, Ali SZ. FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots. Bioinformatics 2020; 36:3662-3668. [PMID: 32170927 DOI: 10.1093/bioinformatics/btaa175] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Revised: 02/10/2020] [Accepted: 03/12/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is important and challenging problem of computational biology. Most of the existing methods can only provide a short length multiple alignments in an acceptable time. Nevertheless, when the researchers confront the genome size in the multiple alignments, the process has required a huge processing space/time. Accordingly, using the method that can align genome size rapidly and precisely has a great effect, especially on the analysis of the very long alignments. Herein, we have proposed an efficient method, called FAME, which vertically divides sequences from the places that they have common areas; then they are arranged in consecutive order. Then these common areas are shifted and placed under each other, and the subsequences between them are aligned using any existing MSA tool. RESULTS The results demonstrate that the combination of FAME and the MSA methods and deploying minimizer are capable to be executed on personal computer and finely align long length sequences with much higher sum-of-pair (SP) score compared to the standalone MSA tools. As we select genomic datasets with longer length, the SP score of the combinatorial methods is gradually improved. The calculated computational complexity of methods supports the results in a way that combining FAME and the MSA tools leads to at least four times faster execution on the datasets. AVAILABILITY AND IMPLEMENTATION The source code and all datasets and run-parameters are accessible free on http://github.com/naznoosh/msa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Etminan Naznooshsadat
- Department of Computer Engineering, Shiraz Branch, Islamic Azad University, Shiraz, Iran
| | - Parvinnia Elham
- Department of Computer Engineering, Shiraz Branch, Islamic Azad University, Shiraz, Iran
| | - Sharifi-Zarchi Ali
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
6
|
Bekhouche S, Mohamed Ben Ali Y. Feature Selection in GPCR Classification Using BAT Algorithm. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2020. [DOI: 10.1142/s1469026820500066] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
G-Protein-Coupled Receptors (GPCR) are the large family of protein membrane; and until now some of them still remain orphans. Predicting GPCR functions is a challenging task, it depends closely to their classification, which requires a digital representation of each protein chain as an attribute vector. A major problem of GPCR databases is their great number of features which can produce combinatorial explosion and increase the complexity of classification algorithms. Feature selection techniques are used to deal with this problem by minimizing features space dimension, and keeping the most relevant ones. In this paper, we propose to use the BAT algorithm for extracting the pertinent features and to improve the classification results. We compared the results obtained by our system with two other bio-inspired algorithms, Evolutionary Algorithm and PSO search. Metrics quality measures used for comparison are Error Rate, Accuracy, MCC and [Formula: see text]-measure. Experimental results indicate that our system is more efficient.
Collapse
Affiliation(s)
- Safia Bekhouche
- Department of Computer Science, Badji Mokhtar University, Annaba 23000, Algeria
| | - Yamina Mohamed Ben Ali
- Lboratory of Research in Informatics (LRI), Badji Mokhtar University, Annaba 23000, Algeria
| |
Collapse
|
7
|
Ipoutcha T, Tsarmpopoulos I, Talenton V, Gaspin C, Moisan A, Walker CA, Brownlie J, Blanchard A, Thebault P, Sirand-Pugnet P. Multiple Origins and Specific Evolution of CRISPR/Cas9 Systems in Minimal Bacteria ( Mollicutes). Front Microbiol 2019; 10:2701. [PMID: 31824468 PMCID: PMC6882279 DOI: 10.3389/fmicb.2019.02701] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Accepted: 11/07/2019] [Indexed: 12/13/2022] Open
Abstract
CRISPR/Cas systems provide adaptive defense mechanisms against invading nucleic acids in prokaryotes. Because of its interest as a genetic tool, the Type II CRISPR/Cas9 system from Streptococcus pyogenes has been extensively studied. It includes the Cas9 endonuclease that is dependent on a dual-guide RNA made of a tracrRNA and a crRNA. Target recognition relies on crRNA annealing and the presence of a protospacer adjacent motif (PAM). Mollicutes are currently the bacteria with the smallest genome in which CRISPR/Cas systems have been reported. Many of them are pathogenic to humans and animals (mycoplasmas and ureaplasmas) or plants (phytoplasmas and some spiroplasmas). A global survey was conducted to identify and compare CRISPR/Cas systems found in the genome of these minimal bacteria. Complete or degraded systems classified as Type II-A and less frequently as Type II-C were found in the genome of 21 out of 52 representative mollicutes species. Phylogenetic reconstructions predicted a common origin of all CRISPR/Cas systems of mycoplasmas and at least two origins were suggested for spiroplasmas systems. Cas9 in mollicutes were structurally related to the S. aureus Cas9 except the PI domain involved in the interaction with the PAM, suggesting various PAM might be recognized by Cas9 of different mollicutes. Structure of the predicted crRNA/tracrRNA hybrids was conserved and showed typical stem-loop structures pairing the Direct Repeat part of crRNAs with the 5' region of tracrRNAs. Most mollicutes crRNA/tracrRNAs showed G + C% significantly higher than the genome, suggesting a selective pressure for maintaining stability of these secondary structures. Examples of CRISPR spacers matching with mollicutes phages were found, including the textbook case of Mycoplasma cynos strain C142 having no prophage sequence but a CRISPR/Cas system with spacers targeting prophage sequences that were found in the genome of another M. cynos strain that is devoid of a CRISPR system. Despite their small genome size, mollicutes have maintained protective means against invading DNAs, including restriction/modification and CRISPR/Cas systems. The apparent lack of CRISPR/Cas systems in several groups of species including main pathogens of humans, ruminants, and plants suggests different evolutionary routes or a lower risk of phage infection in specific ecological niches.
Collapse
Affiliation(s)
- Thomas Ipoutcha
- INRA, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France.,Université de Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France
| | - Iason Tsarmpopoulos
- INRA, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France.,Université de Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France
| | - Vincent Talenton
- INRA, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France.,Université de Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France
| | - Christine Gaspin
- INRA, Mathématiques et Informatique Appliquées de Toulouse, Université de Toulouse, Toulouse, France
| | - Annick Moisan
- INRA, Mathématiques et Informatique Appliquées de Toulouse, Université de Toulouse, Toulouse, France
| | - Caray A Walker
- School of Life Sciences, Anglia Ruskin University, Cambridge, United Kingdom
| | - Joe Brownlie
- Department of Pathobiology and Population Sciences, Royal Veterinary College, University of London, London, United Kingdom
| | - Alain Blanchard
- INRA, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France.,Université de Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France
| | | | - Pascal Sirand-Pugnet
- INRA, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France.,Université de Bordeaux, UMR 1332 de Biologie du Fruit et Pathologie, Villenave d'Ornon, France
| |
Collapse
|
8
|
Nakamura T, Yamada KD, Tomii K, Katoh K. Parallelization of MAFFT for large-scale multiple sequence alignments. Bioinformatics 2019; 34:2490-2492. [PMID: 29506019 PMCID: PMC6041967 DOI: 10.1093/bioinformatics/bty121] [Citation(s) in RCA: 612] [Impact Index Per Article: 102.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 02/28/2018] [Indexed: 12/03/2022] Open
Abstract
Summary We report an update for the MAFFT multiple sequence alignment program to enable parallel calculation of large numbers of sequences. The G-INS-1 option of MAFFT was recently reported to have higher accuracy than other methods for large data, but this method has been impractical for most large-scale analyses, due to the requirement of large computational resources. We introduce a scalable variant, G-large-INS-1, which has equivalent accuracy to G-INS-1 and is applicable to 50 000 or more sequences. Availability and implementation This feature is available in MAFFT versions 7.355 or later at https://mafft.cbrc.jp/alignment/software/mpi.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tsukasa Nakamura
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Kazunori D Yamada
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kentaro Tomii
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Biotechnology Research Institute for Drug Discovery (BRD), AIST, Tokyo, Japan.,AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), Tokyo, Japan
| | - Kazutaka Katoh
- Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Research Institute for Microbial Diseases, Osaka University, Suita, Japan
| |
Collapse
|
9
|
González-Domínguez J, Bolón-Canedo V, Freire B, Touriño J. Parallel feature selection for distributed-memory clusters. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.01.050] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
10
|
Levy N, Bruneau JM, Le Rouzic E, Bonnard D, Le Strat F, Caravano A, Chevreuil F, Barbion J, Chasset S, Ledoussal B, Moreau F, Ruff M. Structural Basis for E. coli Penicillin Binding Protein (PBP) 2 Inhibition, a Platform for Drug Design. J Med Chem 2019; 62:4742-4754. [PMID: 30995398 DOI: 10.1021/acs.jmedchem.9b00338] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
Penicillin-binding proteins (PBPs) are the targets of the β-lactams, the most successful class of antibiotics ever developed against bacterial infections. Unfortunately, the worldwide and rapid spread of large spectrum β-lactam resistance genes such as carbapenemases is detrimental to the use of antibiotics in this class. New potent PBP inhibitors are needed, especially compounds that resist β-lactamase hydrolysis. Here we describe the structure of the E. coli PBP2 in its Apo form and upon its reaction with 2 diazabicyclo derivatives, avibactam and CPD4, a new potent PBP2 inhibitor. Examination of these structures shows that unlike avibactam, CPD4 can perform a hydrophobic stacking on Trp370 in the active site of E. coli PBP2. This result, together with sequence analysis, homology modeling, and SAR, allows us to propose CPD4 as potential starting scaffold to develop molecules active against a broad range of bacterial species at the top of the WHO priority list.
Collapse
Affiliation(s)
- Nicolas Levy
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France.,IGBMC , 1 Rue Laurent Fries , 67404 Illkirch , France
| | | | - Erwann Le Rouzic
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | - Damien Bonnard
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | | | - Audrey Caravano
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | | | - Julien Barbion
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | - Sophie Chasset
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | - Benoît Ledoussal
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | - François Moreau
- Mutabilis , 102 Avenue Gaston Roussel , 93230 Romainville , France
| | - Marc Ruff
- IGBMC , 1 Rue Laurent Fries , 67404 Illkirch , France
| |
Collapse
|
11
|
Features of a novel protein, rusticalin, from the ascidian Styela rustica reveal ancestral horizontal gene transfer event. Mob DNA 2019; 10:4. [PMID: 30675192 PMCID: PMC6339383 DOI: 10.1186/s13100-019-0146-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Accepted: 01/02/2019] [Indexed: 12/18/2022] Open
Abstract
Background The transfer of genetic material from non-parent organisms is called horizontal gene transfer (HGT). One of the most conclusive cases of HGT in metazoans was previously described for the cellulose synthase gene in ascidians. Results In this study we identified a new protein, rusticalin, from the ascidian Styela rustica and presented evidence for its likely origin by HGT. Discernible homologues of rusticalin were found in placozoans, coral, and basal Chordates. Rusticalin was predicted to consist of two distinct regions, an N-terminal domain and a C-terminal domain. The N-terminal domain comprises two cysteine-rich repeats and shows remote similarity to the tick carboxypeptidase inhibitor. The C-terminal domain shares significant sequence similarity with bacterial MD peptidases and bacteriophage A500 L-alanyl-D-glutamate peptidase. A possible transfer of the C-terminal domain by bacteriophage was confirmed by an analysis of noncoding sequences of C. intestinalis rusticalin-like gene, which was found to contain a sequence similar to the bacteriophage A500 recombination site. Moreover, a sequence similar to the bacteriophage recombination site was found to be adjacent to the cellulose synthase catalytic subunit gene in the genome of Streptomices sp., the donor of ascidian cellulose synthase. Conclusions The C-terminal domain of rusticalin and rusticalin-like proteins is likely to be horizontally transferred by the bacteriophage A500. A common mechanism involving bacteriophage mediated gene transfer can be proposed for at least two HGT events in ascidians.
Collapse
|
12
|
Gonzalez-Dominguez J, Martin MJ. MPIGeneNet: Parallel Calculation of Gene Co-Expression Networks on Multicore Clusters. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1732-1737. [PMID: 29028205 DOI: 10.1109/tcbb.2017.2761340] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In this work, we present MPIGeneNet, a parallel tool that applies Pearson's correlation and Random Matrix Theory to construct gene co-expression networks. It is based on the state-of-the-art sequential tool RMTGeneNet, which provides networks with high robustness and sensitivity at the expenses of relatively long runtimes for large scale input datasets. MPIGeneNet returns the same results as RMTGeneNet but improves the memory management, reduces the I/O cost, and accelerates the two most computationally demanding steps of co-expression network construction by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on two different systems using three typical input datasets shows that MPIGeneNet is significantly faster than RMTGeneNet. As an example, our tool is up to 175.41 times faster on a cluster with eight nodes, each one containing two 12-core Intel Haswell processors. The source code of MPIGeneNet, as well as a reference manual, are available at https://sourceforge.net/projects/mpigenenet/.
Collapse
|
13
|
González-Domínguez J, Expósito RR. ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems. PLoS One 2018; 13:e0194361. [PMID: 29608567 PMCID: PMC5880350 DOI: 10.1371/journal.pone.0194361] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Accepted: 03/01/2018] [Indexed: 11/18/2022] Open
Abstract
Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.
Collapse
Affiliation(s)
| | - Roberto R. Expósito
- Grupo de Arquitectura de Computadores, Universidade da Coruña, A Coruña, Spain
| |
Collapse
|
14
|
Sablok G, Hayward RJ, Davey PA, Santos RP, Schliep M, Larkum A, Pernice M, Dolferus R, Ralph PJ. SeagrassDB: An open-source transcriptomics landscape for phylogenetically profiled seagrasses and aquatic plants. Sci Rep 2018; 8:2749. [PMID: 29426939 PMCID: PMC5807536 DOI: 10.1038/s41598-017-18782-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 12/11/2017] [Indexed: 12/04/2022] Open
Abstract
Seagrasses and aquatic plants are important clades of higher plants, significant for carbon sequestration and marine ecological restoration. They are valuable in the sense that they allow us to understand how plants have developed traits to adapt to high salinity and photosynthetically challenged environments. Here, we present a large-scale phylogenetically profiled transcriptomics repository covering seagrasses and aquatic plants. SeagrassDB encompasses a total of 1,052,262 unigenes with a minimum and maximum contig length of 8,831 bp and 16,705 bp respectively. SeagrassDB provides access to 34,455 transcription factors, 470,568 PFAM domains, 382,528 prosite models and 482,121 InterPro domains across 9 species. SeagrassDB allows for the comparative gene mining using BLAST-based approaches and subsequent unigenes sequence retrieval with associated features such as expression (FPKM values), gene ontologies, functional assignments, family level classification, Interpro domains, KEGG orthology (KO), transcription factors and prosite information. SeagrassDB is available to the scientific community for exploring the functional genic landscape of seagrass and aquatic plants at: http://115.146.91.129/index.php.
Collapse
Affiliation(s)
- Gaurav Sablok
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia.
| | - Regan J Hayward
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia
| | - Peter A Davey
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia
| | - Rosiane P Santos
- Laboratório de Recursos Genéticos, Universidade Federal de São João Del-Rei, Campus CTAN, São João Del Rei, Minas Gerais, 36307-352, Brazil
| | - Martin Schliep
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia
| | - Anthony Larkum
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia
| | - Mathieu Pernice
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia
| | - Rudy Dolferus
- CSIRO Agriculture and Food, GPO Box 1700, Canberra, ACT 2601, Australia
| | - Peter J Ralph
- Climate Change Cluster (C3), University of Technology Sydney, PO Box 123 Broadway, NSW 2007, Australia.
| |
Collapse
|
15
|
Jończyk J, Malawska B, Bajda M. Hybrid approach to structure modeling of the histamine H3 receptor: Multi-level assessment as a tool for model verification. PLoS One 2017; 12:e0186108. [PMID: 28982153 PMCID: PMC5629032 DOI: 10.1371/journal.pone.0186108] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2017] [Accepted: 09/25/2017] [Indexed: 12/18/2022] Open
Abstract
The crucial role of G-protein coupled receptors and the significant achievements associated with a better understanding of the spatial structure of known receptors in this family encouraged us to undertake a study on the histamine H3 receptor, whose crystal structure is still unresolved. The latest literature data and availability of different software enabled us to build homology models of higher accuracy than previously published ones. The new models are expected to be closer to crystal structures; and therefore, they are much more helpful in the design of potential ligands. In this article, we describe the generation of homology models with the use of diverse tools and a hybrid assessment. Our study incorporates a hybrid assessment connecting knowledge-based scoring algorithms with a two-step ligand-based docking procedure. Knowledge-based scoring employs probability theory for global energy minimum determination based on information about native amino acid conformation from a dataset of experimentally determined protein structures. For a two-step docking procedure two programs were applied: GOLD was used in the first step and Glide in the second. Hybrid approaches offer advantages by combining various theoretical methods in one modeling algorithm. The biggest advantage of hybrid methods is their intrinsic ability to self-update and self-refine when additional structural data are acquired. Moreover, the diversity of computational methods and structural data used in hybrid approaches for structure prediction limit inaccuracies resulting from theoretical approximations or fuzziness of experimental data. The results of docking to the new H3 receptor model allowed us to analyze ligand-receptor interactions for reference compounds.
Collapse
Affiliation(s)
- Jakub Jończyk
- Department of Physicochemical Drug Analysis, Faculty of Pharmacy, Jagiellonian University Medical College, Krakow, Poland
| | - Barbara Malawska
- Department of Physicochemical Drug Analysis, Faculty of Pharmacy, Jagiellonian University Medical College, Krakow, Poland
| | - Marek Bajda
- Department of Physicochemical Drug Analysis, Faculty of Pharmacy, Jagiellonian University Medical College, Krakow, Poland
- * E-mail:
| |
Collapse
|