1
|
Cheng T, Chin PJ, Cha K, Petrick N, Mikailov M. Profiling the BLAST bioinformatics application for load balancing on high-performance computing clusters. BMC Bioinformatics 2022; 23:544. [PMID: 36526957 PMCID: PMC9758941 DOI: 10.1186/s12859-022-05029-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 10/31/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Basic Local Alignment Search Tool (BLAST) is a suite of commonly used algorithms for identifying matches between biological sequences. The user supplies a database file and query file of sequences for BLAST to find identical sequences between the two. The typical millions of database and query sequences make BLAST computationally challenging but also well suited for parallelization on high-performance computing clusters. The efficacy of parallelization depends on the data partitioning, where the optimal data partitioning relies on an accurate performance model. In previous studies, a BLAST job was sped up by 27 times by partitioning the database and query among thousands of processor nodes. However, the optimality of the partitioning method was not studied. Unlike BLAST performance models proposed in the literature that usually have problem size and hardware configuration as the only variables, the execution time of a BLAST job is a function of database size, query size, and hardware capability. In this work, the nucleotide BLAST application BLASTN was profiled using three methods: shell-level profiling with the Unix "time" command, code-level profiling with the built-in "profiler" module, and system-level profiling with the Unix "gprof" program. The runtimes were measured for six node types, using six different database files and 15 query files, on a heterogeneous HPC cluster with 500+ nodes. The empirical measurement data were fitted with quadratic functions to develop performance models that were used to guide the data parallelization for BLASTN jobs. RESULTS Profiling results showed that BLASTN contains more than 34,500 different functions, but a single function, RunMTBySplitDB, takes 99.12% of the total runtime. Among its 53 child functions, five core functions were identified to make up 92.12% of the overall BLASTN runtime. Based on the performance models, static load balancing algorithms can be applied to the BLASTN input data to minimize the runtime of the longest job on an HPC cluster. Four test cases being run on homogeneous and heterogeneous clusters were tested. Experiment results showed that the runtime can be reduced by 81% on a homogeneous cluster and by 20% on a heterogeneous cluster by re-distributing the workload. DISCUSSION Optimal data partitioning can improve BLASTN's overall runtime 5.4-fold in comparison with dividing the database and query into the same number of fragments. The proposed methodology can be used in the other applications in the BLAST+ suite or any other application as long as source code is available.
Collapse
Affiliation(s)
- Trinity Cheng
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA ,grid.21107.350000 0001 2171 9311Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD 21218 USA
| | - Pei-Ju Chin
- grid.290496.00000 0001 1945 2072Center for Biologics Evaluation and Research, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| | - Kenny Cha
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| | - Nicholas Petrick
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| | - Mike Mikailov
- grid.417587.80000 0001 2243 3366Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring, MD 20993 USA
| |
Collapse
|
2
|
Yim WC, Swain ML, Ma D, An H, Bird KA, Curdie DD, Wang S, Ham HD, Luzuriaga-Neira A, Kirkwood JS, Hur M, Solomon JKQ, Harper JF, Kosma DK, Alvarez-Ponce D, Cushman JC, Edger PP, Mason AS, Pires JC, Tang H, Zhang X. The final piece of the Triangle of U: Evolution of the tetraploid Brassica carinata genome. Plant Cell 2022; 34:4143-4172. [PMID: 35961044 PMCID: PMC9614464 DOI: 10.1093/plcell/koac249] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 06/24/2022] [Indexed: 05/05/2023]
Abstract
Ethiopian mustard (Brassica carinata) is an ancient crop with remarkable stress resilience and a desirable seed fatty acid profile for biofuel uses. Brassica carinata is one of six Brassica species that share three major genomes from three diploid species (AA, BB, and CC) that spontaneously hybridized in a pairwise manner to form three allotetraploid species (AABB, AACC, and BBCC). Of the genomes of these species, that of B. carinata is the least understood. Here, we report a chromosome scale 1.31-Gbp genome assembly with 156.9-fold sequencing coverage for B. carinata, completing the reference genomes comprising the classic Triangle of U, a classical theory of the evolutionary relationships among these six species. Our assembly provides insights into the hybridization event that led to the current B. carinata genome and the genomic features that gave rise to the superior agronomic traits of B. carinata. Notably, we identified an expansion of transcription factor networks and agronomically important gene families. Completion of the Triangle of U comparative genomics platform has allowed us to examine the dynamics of polyploid evolution and the role of subgenome dominance in the domestication and continuing agronomic improvement of B. carinata and other Brassica species.
Collapse
Affiliation(s)
| | | | - Dongna Ma
- Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Key Laboratory of National Forestry and Grassland Administration for Orchid Conservation and Utilization, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Hong An
- Division of Biological Sciences, University of Missouri, Columbia, Missouri 65201, USA
| | - Kevin A Bird
- Department of Horticulture, Michigan State University, East Lansing, Michigan 48824, USA
| | - David D Curdie
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada 89557, USA
| | - Samuel Wang
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada 89557, USA
| | - Hyun Don Ham
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada 89557, USA
| | | | - Jay S Kirkwood
- Metabolomics Core Facility, Institute for Integrative Genome Biology, University of California, Riverside, California 92521, USA
| | - Manhoi Hur
- Metabolomics Core Facility, Institute for Integrative Genome Biology, University of California, Riverside, California 92521, USA
| | - Juan K Q Solomon
- Department of Agriculture, Veterinary & Rangeland Sciences, University of Nevada, Reno, Nevada 89557, USA
| | - Jeffrey F Harper
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada 89557, USA
| | - Dylan K Kosma
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada 89557, USA
| | | | - John C Cushman
- Department of Biochemistry and Molecular Biology, University of Nevada, Reno, Nevada 89557, USA
| | - Patrick P Edger
- Department of Horticulture, Michigan State University, East Lansing, Michigan 48824, USA
| | - Annaliese S Mason
- Plant Breeding Department, INRES, The University of Bonn, Bonn 53115, Germany
| | - J Chris Pires
- Division of Biological Sciences, Bond Life Sciences Center, , University of Missouri, Columbia, Missouri 65211, USA
| | - Haibao Tang
- Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Key Laboratory of National Forestry and Grassland Administration for Orchid Conservation and Utilization, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Xingtan Zhang
- Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Key Laboratory of Ministry of Education for Genetics, Breeding and Multiple Utilization of Crops, Key Laboratory of National Forestry and Grassland Administration for Orchid Conservation and Utilization, Fujian Agriculture and Forestry University, Fuzhou, China
| |
Collapse
|
3
|
Guerrero-Araya E, Muñoz M, Rodríguez C, Paredes-Sabja D. FastMLST: A Multi-core Tool for Multilocus Sequence Typing of Draft Genome Assemblies. Bioinform Biol Insights 2021; 15:11779322211059238. [PMID: 34866905 PMCID: PMC8637782 DOI: 10.1177/11779322211059238] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 10/19/2021] [Indexed: 11/21/2022] Open
Abstract
Multilocus Sequence Typing (MLST) is a precise microbial typing approach at the
intra-species level for epidemiologic and evolutionary purposes. It operates by
assigning a sequence type (ST) identifier to each specimen, based on a
combination of alleles of multiple housekeeping genes included in a defined
scheme. The use of MLST has multiplied due to the availability of large numbers
of genomic sequences and epidemiologic data in public repositories. However,
data processing speed has become problematic due to the massive size of modern
datasets. Here, we present FastMLST, a tool that is designed to perform PubMLST
searches using BLASTn and a divide-and-conquer approach that processes each
genome assembly in parallel. The output offered by FastMLST includes a table
with the ST, allelic profile, and clonal complex or clade (when available),
detected for a query, as well as a multi-FASTA file or a series of FASTA files
with the concatenated or single allele sequences detected, respectively.
FastMLST was validated with 91 different species, with a wide range of
guanine-cytosine content (%GC), genome sizes, and fragmentation levels, and a
speed test was performed on 3 datasets with varying genome sizes. Compared with
other tools such as mlst, CGE/MLST, MLSTar, and PubMLST, FastMLST takes
advantage of multiple processors to simultaneously type up to 28 000 genomes in
less than 10 minutes, reducing processing times by at least 3-fold with 100%
concordance to PubMLST, if contaminated genomes are excluded from the analysis.
The source code, installation instructions, and documentation of FastMLST are
available at https://github.com/EnzoAndree/FastMLST
Collapse
Affiliation(s)
- Enzo Guerrero-Araya
- Microbiota-Host Interactions and Clostridia Research Group, Facultad de Ciencias de la Vida, Universidad Andrés Bello, Santiago, Chile.,ANID, Millennium Science Initiative Program, Millennium Nucleus in the Biology of the Intestinal Microbiota, Santiago, Chile
| | - Marina Muñoz
- ANID, Millennium Science Initiative Program, Millennium Nucleus in the Biology of the Intestinal Microbiota, Santiago, Chile.,Centro de Investigaciones en Microbiología y Biotecnología-UR (CIMBIUR), Facultad de Ciencias Naturales, Universidad del Rosario, Bogotá, Colombia
| | - César Rodríguez
- Facultad de Microbiología and Centro de Investigación en Enfermedades Tropicales (CIET), Universidad de Costa Rica, San José, Costa Rica
| | - Daniel Paredes-Sabja
- ANID, Millennium Science Initiative Program, Millennium Nucleus in the Biology of the Intestinal Microbiota, Santiago, Chile.,Department of Biology, Texas A&M University, College Station, TX, USA
| |
Collapse
|
4
|
Franco‐Sierra ND, Díaz‐Nieto JF. Rapid mitochondrial genome sequencing based on Oxford Nanopore Sequencing and a proxy for vertebrate species identification. Ecol Evol 2020; 10:3544-3560. [PMID: 32274008 PMCID: PMC7141017 DOI: 10.1002/ece3.6151] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2019] [Revised: 02/09/2020] [Accepted: 02/12/2020] [Indexed: 02/06/2023] Open
Abstract
Molecular information is crucial for species identification when facing challenging morphology-based specimen identifications. The use of DNA barcodes partially solves this problem, but in some cases when PCR is not an option (i.e., primers are not available, problems in reaction standardization), amplification-free approaches could be an optimal alternative. Recent advances in DNA sequencing, like the MinION device from Oxford Nanopore Technologies (ONT), allow to obtain genomic data with low laboratory and technical requirements, and at a relatively low cost. In this study, we explore ONT sequencing for molecular species identification from a total DNA sample obtained from a neotropical rodent and we also test the technology for complete mitochondrial genome reconstruction via genome skimming. We were able to obtain "de novo" the complete mitogenome of a specimen from the genus Melanomys (Cricetidae: Sigmodontinae) with average depth coverage of 78X using ONT-only data and by combining multiple assembly routines. Our pipeline for an automated species identification was able to identify the sample using unassembled sequence data (raw) in a reasonable computing time, which was substantially reduced when a priori information related to the organism identity was known. Our findings suggest ONT sequencing as a suitable candidate to solve species identification problems in metazoan nonmodel organisms and generate complete mtDNA datasets.
Collapse
Affiliation(s)
- Nicolás D. Franco‐Sierra
- Grupo de investigación en Biodiversidad, Evolución y Conservación (BEC)Departamento de Ciencias Biológicas, Escuela de CienciasUniversidad EAFITMedellínColombia
| | - Juan F. Díaz‐Nieto
- Grupo de investigación en Biodiversidad, Evolución y Conservación (BEC)Departamento de Ciencias Biológicas, Escuela de CienciasUniversidad EAFITMedellínColombia
| |
Collapse
|
5
|
Shirshikov FV, Pekov YA, Miroshnikov KA. MorphoCatcher: a multiple-alignment based web tool for target selection and designing taxon-specific primers in the loop-mediated isothermal amplification method. PeerJ 2019; 7:e6801. [PMID: 31086739 PMCID: PMC6487805 DOI: 10.7717/peerj.6801] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Accepted: 03/18/2019] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Advantages of loop-mediated isothermal amplification in molecular diagnostics allow to consider the method as a promising technology of nucleic acid detection in agriculture and medicine. A bioinformatics tool that provides rapid screening and selection of target nucleotide sequences with subsequent taxon-specific primer design toward polymorphic orthologous genes, not only unique or conserved common regions of genome, would contribute to the development of more specific and sensitive diagnostic assays. However, considering features of the original software for primer selection, also known as the PrimerExplorer (Eiken Chemical Co. LTD, Tokyo, Japan), the taxon-specific primer design using multiple sequence alignments of orthologs or even viral genomes with conservative architecture is still complicated. FINDINGS Here, MorphoCatcher is introduced as a fast and simple web plugin for PrimerExplorer with a clear interface. It enables an execution of multiple-alignment based search of taxon-specific mutations, visual screening and selection of target sequences, and easy-to-start specific primer design using the PrimerExplorer software. The combination of MorphoCatcher and PrimerExplorer allows to perform processing of the multiple alignments of orthologs for informative sliding-window plot analysis, which is used to identify the sequence regions with a high density of taxon-specific mutations and cover them by the primer ends for better specificity of amplification. CONCLUSIONS We hope that this new bioinformatics tool developed for target selection and taxon-specific primer design, called the MorphoCatcher, will gain more popularity of the loop-mediated isothermal amplification method for molecular diagnostics community. MorphoCatcher is a simple web plugin tool for the PrimerExplorer software which is freely available only for non-commercial and academic users at http://morphocatcher.ru.
Collapse
Affiliation(s)
- Fedor V Shirshikov
- Shemyakin & Ovchinnikov Institute of Bioorganic Chemistry RAS, Moscow, Russia
| | - Yuri A Pekov
- Lomonosov Moscow State University, Moscow, Russia
| | | |
Collapse
|