1
|
Dong T, Wang Y, Qi C, Fan W, Xie J, Chen H, Zhou H, Han X. Sequencing Methods to Study the Microbiome with Antibiotic Resistance Genes in Patients with Pulmonary Infections. J Microbiol Biotechnol 2024; 34:1617-1626. [PMID: 39113195 PMCID: PMC11380506 DOI: 10.4014/jmb.2402.02004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 05/20/2024] [Accepted: 05/29/2024] [Indexed: 08/29/2024]
Abstract
Various antibiotic-resistant bacteria (ARB) are known to induce repeated pulmonary infections and increase morbidity and mortality. A thorough knowledge of antibiotic resistance is imperative for clinical practice to treat resistant pulmonary infections. In this study, we used a reads-based method and an assembly-based method according to the metagenomic next-generation sequencing (mNGS) data to reveal the spectra of ARB and corresponding antibiotic resistance genes (ARGs) in samples from patients with pulmonary infections. A total of 151 clinical samples from 144 patients with pulmonary infections were collected for retrospective analysis. The ARB and ARGs detection performance was compared by the reads-based method and assembly-based method with the culture method and antibiotic susceptibility testing (AST), respectively. In addition, ARGs and the attribution relationship of common ARB were analyzed by the two methods. The comparison results showed that the assembly-based method could assist in determining pathogens detected by the reads-based method as true ARB and improve the predictive capabilities (46% > 13%). ARG-ARB network analysis revealed that assembly-based method could promote determining clear ARG-bacteria attribution and 101 ARGs were detected both in two methods. 25 ARB were obtained by both methods, of which the most predominant ARB and its ARGs in the samples of pulmonary infections were Acinetobacter baumannii (ade), Pseudomonas aeruginosa (mex), Klebsiella pneumoniae (emr), and Stenotrophomonas maltophilia (sme). Collectively, our findings demonstrated that the assembly-based method could be a supplement to the reads-based method and uncovered pulmonary infection-associated ARB and ARGs as potential antibiotic treatment targets.
Collapse
Affiliation(s)
- Tingyan Dong
- Integrated Diagnostic Centre for Infectious Diseases, Guangzhou Huayin Medical Laboratory Center, Guangzhou, P.R. China
- Immunology and Reproduction Biology Laboratory & State Key Laboratory of Analytical Chemistry for Life Sciences, Medical School, Nanjing University, Nanjing, P.R. China
| | - Yongsi Wang
- Immunology and Reproduction Biology Laboratory & State Key Laboratory of Analytical Chemistry for Life Sciences, Medical School, Nanjing University, Nanjing, P.R. China
| | - Chunxia Qi
- Department of Hospital Infection Management, NanFang Hospital, Southern Medical University, Guangzhou, P.R. China
| | - Wentao Fan
- Immunology and Reproduction Biology Laboratory & State Key Laboratory of Analytical Chemistry for Life Sciences, Medical School, Nanjing University, Nanjing, P.R. China
| | - Junting Xie
- Immunology and Reproduction Biology Laboratory & State Key Laboratory of Analytical Chemistry for Life Sciences, Medical School, Nanjing University, Nanjing, P.R. China
| | - Haitao Chen
- Immunology and Reproduction Biology Laboratory & State Key Laboratory of Analytical Chemistry for Life Sciences, Medical School, Nanjing University, Nanjing, P.R. China
| | - Hao Zhou
- Department of Hospital Infection Management, NanFang Hospital, Southern Medical University, Guangzhou, P.R. China
| | - Xiaodong Han
- Immunology and Reproduction Biology Laboratory & State Key Laboratory of Analytical Chemistry for Life Sciences, Medical School, Nanjing University, Nanjing, P.R. China
- Jiangsu Key Laboratory of Molecular Medicine, Nanjing University, Nanjing, P.R. China
| |
Collapse
|
2
|
Chen Y, Huang JH, Sun Y, Zhang Y, Li Y, Xu X. Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing. CELL REPORTS METHODS 2024; 4:100754. [PMID: 38614089 PMCID: PMC11133727 DOI: 10.1016/j.crmeth.2024.100754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 01/03/2024] [Accepted: 03/20/2024] [Indexed: 04/15/2024]
Abstract
Precision medicine's emphasis on individual genetic variants highlights the importance of haplotype-resolved assembly, a computational challenge in bioinformatics given its combinatorial nature. While classical algorithms have made strides in addressing this issue, the potential of quantum computing remains largely untapped. Here, we present the vehicle routing problem (VRP) assembler: an approach that transforms this task into a vehicle routing problem, an optimization formulation solvable on a quantum computer. We demonstrate its potential and feasibility through a proof of concept on short synthetic diploid and triploid genomes using a D-Wave quantum annealer. To tackle larger-scale assembly problems, we integrate the VRP assembler with Google's OR-Tools, achieving a haplotype-resolved local assembly across the human major histocompatibility complex (MHC) region. Our results show encouraging performance compared to Hifiasm with phasing accuracy approaching the theoretical limit, underscoring the promising future of quantum computing in bioinformatics.
Collapse
Affiliation(s)
- Yibo Chen
- BGI Research, Shenzhen 518083, China
| | | | - Yuhui Sun
- BGI Research, Shenzhen 518083, China
| | - Yong Zhang
- BGI Research, Wuhan 430047, China; Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China.
| | - Yuxiang Li
- BGI Research, Wuhan 430047, China; Guangdong Bigdata Engineering Technology Research Center for Life Sciences, BGI Research, Shenzhen 518083, China.
| | - Xun Xu
- BGI Research, Shenzhen 518083, China; BGI Research, Wuhan 430047, China.
| |
Collapse
|
3
|
Goussarov G, Mysara M, Vandamme P, Van Houdt R. Introduction to the principles and methods underlying the recovery of metagenome-assembled genomes from metagenomic data. Microbiologyopen 2022; 11:e1298. [PMID: 35765182 PMCID: PMC9179125 DOI: 10.1002/mbo3.1298] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 05/19/2022] [Accepted: 05/19/2022] [Indexed: 11/18/2022] Open
Abstract
The rise of metagenomics offers a leap forward for understanding the genetic diversity of microorganisms in many different complex environments by providing a platform that can identify potentially unlimited numbers of known and novel microorganisms. As such, it is impossible to imagine new major initiatives without metagenomics. Nevertheless, it represents a relatively new discipline with various levels of complexity and demands on bioinformatics. The underlying principles and methods used in metagenomics are often seen as common knowledge and often not detailed or fragmented. Therefore, we reviewed these to guide microbiologists in taking the first steps into metagenomics. We specifically focus on a workflow aimed at reconstructing individual genomes, that is, metagenome-assembled genomes, integrating DNA sequencing, assembly, binning, identification and annotation.
Collapse
Affiliation(s)
- Gleb Goussarov
- Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN)MolBelgium
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Faculty of SciencesGhent UniversityGhentBelgium
| | - Mohamed Mysara
- Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN)MolBelgium
| | - Peter Vandamme
- Laboratory of Microbiology and BCCM/LMG Bacteria Collection, Faculty of SciencesGhent UniversityGhentBelgium
| | - Rob Van Houdt
- Microbiology Unit, Belgian Nuclear Research Centre (SCK CEN)MolBelgium
| |
Collapse
|
4
|
Bhat GR, Sethi I, Rah B, Kumar R, Afroze D. Innovative in Silico Approaches for Characterization of Genes and Proteins. Front Genet 2022; 13:865182. [PMID: 35664302 PMCID: PMC9159363 DOI: 10.3389/fgene.2022.865182] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2022] [Accepted: 04/11/2022] [Indexed: 11/13/2022] Open
Abstract
Bioinformatics is an amalgamation of biology, mathematics and computer science. It is a science which gathers the information from biology in terms of molecules and applies the informatic techniques to the gathered information for understanding and organizing the data in a useful manner. With the help of bioinformatics, the experimental data generated is stored in several databases available online like nucleotide database, protein databases, GENBANK and others. The data stored in these databases is used as reference for experimental evaluation and validation. Till now several online tools have been developed to analyze the genomic, transcriptomic, proteomics, epigenomics and metabolomics data. Some of them include Human Splicing Finder (HSF), Exonic Splicing Enhancer Mutation taster, and others. A number of SNPs are observed in the non-coding, intronic regions and play a role in the regulation of genes, which may or may not directly impose an effect on the protein expression. Many mutations are thought to influence the splicing mechanism by affecting the existing splice sites or creating a new sites. To predict the effect of mutation (SNP) on splicing mechanism/signal, HSF was developed. Thus, the tool is helpful in predicting the effect of mutations on splicing signals and can provide data even for better understanding of the intronic mutations that can be further validated experimentally. Additionally, rapid advancement in proteomics have steered researchers to organize the study of protein structure, function, relationships, and dynamics in space and time. Thus the effective integration of all of these technological interventions will eventually lead to steering up of next-generation systems biology, which will provide valuable biological insights in the field of research, diagnostic, therapeutic and development of personalized medicine.
Collapse
Affiliation(s)
- Gh. Rasool Bhat
- Advanced Centre for Human Genetics, Sher-I- Kashmir Institute of Medical Sciences, Soura, India
| | - Itty Sethi
- Institute of Human Genetics, University of Jammu, Jammu, India
| | - Bilal Rah
- Advanced Centre for Human Genetics, Sher-I- Kashmir Institute of Medical Sciences, Soura, India
| | - Rakesh Kumar
- School of Biotechnology, Shri Mata Vaishno Devi University, Katra, India
| | - Dil Afroze
- Advanced Centre for Human Genetics, Sher-I- Kashmir Institute of Medical Sciences, Soura, India
| |
Collapse
|
5
|
Akoniyon OP, Adewumi TS, Maharaj L, Oyegoke OO, Roux A, Adeleke MA, Maharaj R, Okpeku M. Whole Genome Sequencing Contributions and Challenges in Disease Reduction Focused on Malaria. BIOLOGY 2022; 11:587. [PMID: 35453786 PMCID: PMC9027812 DOI: 10.3390/biology11040587] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Revised: 03/31/2022] [Accepted: 04/01/2022] [Indexed: 12/11/2022]
Abstract
Malaria elimination remains an important goal that requires the adoption of sophisticated science and management strategies in the era of the COVID-19 pandemic. The advent of next generation sequencing (NGS) is making whole genome sequencing (WGS) a standard today in the field of life sciences, as PCR genotyping and targeted sequencing provide insufficient information compared to the whole genome. Thus, adapting WGS approaches to malaria parasites is pertinent to studying the epidemiology of the disease, as different regions are at different phases in their malaria elimination agenda. Therefore, this review highlights the applications of WGS in disease management, challenges of WGS in controlling malaria parasites, and in furtherance, provides the roles of WGS in pursuit of malaria reduction and elimination. WGS has invaluable impacts in malaria research and has helped countries to reach elimination phase rapidly by providing required information needed to thwart transmission, pathology, and drug resistance. However, to eliminate malaria in sub-Saharan Africa (SSA), with high malaria transmission, we recommend that WGS machines should be readily available and affordable in the region.
Collapse
Affiliation(s)
- Olusegun Philip Akoniyon
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Taiye Samson Adewumi
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Leah Maharaj
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Olukunle Olugbenle Oyegoke
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Alexandra Roux
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Matthew A. Adeleke
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| | - Rajendra Maharaj
- Office of Malaria Research, South African Medical Research Council, Cape Town 7505, South Africa;
| | - Moses Okpeku
- Discipline of Genetics, School of Life Sciences, University of KwaZulu-Natal, Westville Campus, Durban 4041, South Africa; (O.P.A.); (T.S.A.); (L.M.); (O.O.O.); (A.R.); (M.A.A.)
| |
Collapse
|
6
|
Manual Annotation Studio (MAS): a collaborative platform for manual functional annotation of viral and microbial genomes. BMC Genomics 2021; 22:733. [PMID: 34627149 PMCID: PMC8501643 DOI: 10.1186/s12864-021-08029-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 09/22/2021] [Indexed: 11/10/2022] Open
Abstract
Background Functional genome annotation is the process of labelling functional genomic regions with descriptive information. Manual curation can produce higher quality genome annotations than fully automated methods. Manual annotation efforts are time-consuming and complex; however, software can help reduce these drawbacks. Results We created Manual Annotation Studio (MAS) to improve the efficiency of the process of manual functional annotation prokaryotic and viral genomes. MAS allows users to upload unannotated genomes, provides an interface to edit and upload annotations, tracks annotation history and progress, and saves data to a relational database. MAS provides users with pertinent information through a simple point and click interface to execute and visualize results for multiple homology search tools (blastp, rpsblast, and HHsearch) against multiple databases (Swiss-Prot, nr, CDD, PDB, and an internally generated database). MAS was designed to accept connections over the local area network (LAN) of a lab or organization so multiple users can access it simultaneously. MAS can take advantage of high-performance computing (HPC) clusters by interfacing with SGE or SLURM and data can be exported from MAS in a variety of formats (FASTA, GenBank, GFF, and excel). Conclusions MAS streamlines and provides structure to manual functional annotation projects. MAS enhances the ability of users to generate, interpret, and compare results from multiple tools. The structure that MAS provides can improve project organization and reduce annotation errors. MAS is ideal for team-based annotation projects because it facilitates collaboration. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-08029-8.
Collapse
|
7
|
Birney E. The International Human Genome Project. Hum Mol Genet 2021; 30:R161-R163. [PMID: 34264324 PMCID: PMC8490009 DOI: 10.1093/hmg/ddab198] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Revised: 07/08/2021] [Accepted: 07/09/2021] [Indexed: 12/01/2022] Open
Abstract
The human genome project was conceived and executed as an international project, due to both pragmatic and principled reasons. This internationality has served the project well, with the resulting human genome being freely available for all researchers in all countries. Over time the reference human genome will likely have to evolve to a graph genome, and tap into more diverse sequences worldwide. A similar international mindset underpins data analysis for the interpretation of the human genome from basic to clinical research.
Collapse
Affiliation(s)
- Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| |
Collapse
|
8
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
9
|
Dida F, Yi G. Empirical evaluation of methods for de novo genome assembly. PeerJ Comput Sci 2021; 7:e636. [PMID: 34307867 PMCID: PMC8279138 DOI: 10.7717/peerj-cs.636] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 06/19/2021] [Indexed: 06/12/2023]
Abstract
Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm's performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms.
Collapse
Affiliation(s)
- Firaol Dida
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul, South Korea
| |
Collapse
|
10
|
Zhang W, Kang Y, Dai X, Xu S, Zhao PX. PIP-SNP: a pipeline for processing SNP data featured as linkage disequilibrium bin mapping, genotype imputing and marker synthesizing. NAR Genom Bioinform 2021; 3:lqab060. [PMID: 34235432 PMCID: PMC8256826 DOI: 10.1093/nargab/lqab060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 05/15/2021] [Accepted: 06/14/2021] [Indexed: 11/12/2022] Open
Abstract
Genome-wide association study data analyses often face two significant challenges: (i) high dimensionality of single-nucleotide polymorphism (SNP) genotypes and (ii) imputation of missing values. SNPs are not independent due to physical linkage and natural selection. The correlation of nearby SNPs is known as linkage disequilibrium (LD), which can be used for LD conceptual SNP bin mapping, missing genotype inferencing and SNP dimension reduction. We used a stochastic process to describe the SNP signals and proposed two types of autocorrelations to measure nearby SNPs' information redundancy. Based on the calculated autocorrelation coefficients, we constructed LD bins. We adopted a k-nearest neighbors algorithm (kNN) to impute the missing genotypes. We proposed several novel methods to find the optimal synthetic marker to represent the SNP bin. We also proposed methods to evaluate the information loss or information conservation between using the original genome-wide markers and using dimension-reduced synthetic markers. Our performance assessments on the real-life SNP data from a rice recombinant inbred line (RIL) population and a rice HapMap project show that the new methods produce satisfactory results. We implemented these functional modules in C/C++ and streamlined them into a web-based pipeline named PIP-SNP (https://bioinfo.noble.org/PIP_SNP/) for processing SNP data.
Collapse
Affiliation(s)
- Wenchao Zhang
- Noble Research Institute LLC, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| | - Yun Kang
- Noble Research Institute LLC, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| | - Xinbin Dai
- Noble Research Institute LLC, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| | - Shizhong Xu
- Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Patrick X Zhao
- Noble Research Institute LLC, 2510 Sam Noble Parkway, Ardmore, OK 73401, USA
| |
Collapse
|
11
|
Berger B, Waterman MS, Yu YW. Levenshtein Distance, Sequence Comparison and Biological Database Search. IEEE TRANSACTIONS ON INFORMATION THEORY 2021; 67:3287-3294. [PMID: 34257466 PMCID: PMC8274556 DOI: 10.1109/tit.2020.2996543] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, and also with the Department of Computer Science and AI Lab, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
| | - Michael S Waterman
- Quantitative and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089 USA
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada, and also with the Department of Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, ON M1C 1A4, Canada
| |
Collapse
|
12
|
Ortiz-Aguirre JP, Velandia-Vargas EA, Rodríguez-Bohorquez OM, Amaya-Ramírez D, Bernal-Estévez D, Parra-López CA. Inmunoterapia personalizada contra el cáncer basada en neoantígenos. Revisión de la literatura. REVISTA DE LA FACULTAD DE MEDICINA 2021. [DOI: 10.15446/revfacmed.v69n3.81633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Introducción. Los avances que se han hecho en inmunoterapia contra el cáncer y la respuesta clínica de los pacientes que han recibido este tipo de terapia la han convertido en el cuarto pilar para el tratamiento del cáncer.
Objetivo. Describir brevemente el fundamento biológico de la inmunoterapia personalizada contra el cáncer basada en neoantígenos, las perspectivas actuales de su desarrollo y algunos resultados clínicos de esta terapia.
Materiales y métodos. Se realizó una búsqueda de la literatura en PubMed, Scopus y EBSCO utilizando la siguiente estrategia de búsqueda: tipo de artículos: estudios experimentales originales, ensayos clínicos y revisiones narrativas y sistemáticas sobre métodos de identificación de mutaciones generadas en los tumores y estrategias de inmunoterapia del cáncer con vacunas basadas en neoantígenos; población de estudio: humanos y modelos animales; periodo de publicación: enero 1989- diciembre 2019; idioma: inglés y español; términos de búsqueda: “Immunotherapy”, “Neoplasms”, “Mutation” y “Cancer Vaccines”.
Resultados. La búsqueda inicial arrojó 1344 registros; luego de remover duplicados (n=176), 780 fueron excluidos luego de leer su resumen y título, y se evaluó el texto completo de 338 para verificar cuáles cumplían con los criterios de inclusión, seleccionándose finalmente 73 estudios para análisis completo. Todos los artículos recuperados se publicaron en inglés, y fueron realizados principalmente en EE. UU. (43.83%) y Alemania (23.65%). En el caso de los estudios originales (n=43), 20 se realizaron únicamente en humanos, 9 solo en animales, 2 en ambos modelos, y 12 usaron metodología in silico.
Conclusión. La inmunoterapia personalizada contra el cáncer con vacunas basadas en neoantígenos tumorales se está convirtiendo de forma contundente en una nueva alternativa para tratar el cáncer. Sin embargo, para lograr su implementación adecuada, es necesario usarla en combinación con tratamientos convencionales, generar más conocimiento que contribuya a aclarar la inmunobiología del cáncer, y reducir los costos asociados con su producción.
Collapse
|
13
|
Touati R, Tajouri A, Mesaoudi I, Oueslati AE, Lachiri Z, Kharrat M. New methodology for repetitive sequences identification in human X and Y chromosomes. Biomed Signal Process Control 2021; 64:102207. [PMID: 33101452 PMCID: PMC7572123 DOI: 10.1016/j.bspc.2020.102207] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 07/23/2020] [Accepted: 09/01/2020] [Indexed: 11/24/2022]
Abstract
Repetitive DNA sequences occupy the major proportion of DNA in the human genome and even in the other species' genomes. The importance of each repetitive DNA type depends on many factors: structural and functional roles, positions, lengths and numbers of these repetitions are clear examples. Conserving such DNA sequences or not in different locations in the chromosome remains a challenge for researchers in biology. Detecting their location despite their great variability and finding novel repetitive sequences remains a challenging task. To side-step this problem, we developed a new method based on signal and image processing tools. In fact, using this method we could find repetitive patterns in DNA images regardless of the repetition length. This new technique seems to be more efficient in detecting new repetitive sequences than bioinformatics tools. In fact, the classical tools present limited performances especially in case of mutations (insertion or deletion). However, modifying one or a few numbers of pixels in the image doesn't affect the global form of the repetitive pattern. As a consequence, we generated a new repetitive patterns database which contains tandem and dispersed repeated sequences. The highly repetitive sequences, we have identified in X and Y chromosomes, are shown to be located in other human chromosomes or in other genomes. The data we have generated is then taken as input to a Convolutional neural network classifier in order to classify them. The system we have constructed is efficient and gives an average of 94.4% as recognition score.
Collapse
Affiliation(s)
- Rabeb Touati
- University of Tunis El Manar, LR99ES10 Human Genetics Laboratory, Faculty of Medicine of Tunis (FMT), Tunisia
- University of Tunis El Manar, SITI Laboratory, National School of Engineers of Tunis, BP 37, Le Belvédère, 1002, Tunis, Tunisia
| | - Asma Tajouri
- University of Tunis El Manar, LR99ES10 Human Genetics Laboratory, Faculty of Medicine of Tunis (FMT), Tunisia
| | - Imen Mesaoudi
- University of Tunis El Manar, SITI Laboratory, National School of Engineers of Tunis, BP 37, Le Belvédère, 1002, Tunis, Tunisia
| | - Afef Elloumi Oueslati
- University of Tunis El Manar, SITI Laboratory, National School of Engineers of Tunis, BP 37, Le Belvédère, 1002, Tunis, Tunisia
| | - Zied Lachiri
- University of Tunis El Manar, SITI Laboratory, National School of Engineers of Tunis, BP 37, Le Belvédère, 1002, Tunis, Tunisia
| | - Maher Kharrat
- University of Tunis El Manar, LR99ES10 Human Genetics Laboratory, Faculty of Medicine of Tunis (FMT), Tunisia
| |
Collapse
|
14
|
Modern Approaches for Transcriptome Analyses in Plants. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2021; 1346:11-50. [DOI: 10.1007/978-3-030-80352-0_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
15
|
Lee N, Park MJ, Song W, Jeon K, Jeong S. Currently Applied Molecular Assays for Identifying ESR1 Mutations in Patients with Advanced Breast Cancer. Int J Mol Sci 2020; 21:ijms21228807. [PMID: 33233830 PMCID: PMC7699999 DOI: 10.3390/ijms21228807] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 11/17/2020] [Accepted: 11/19/2020] [Indexed: 12/11/2022] Open
Abstract
Approximately 70% of breast cancers, the leading cause of cancer-related mortality worldwide, are positive for the estrogen receptor (ER). Treatment of patients with luminal subtypes is mainly based on endocrine therapy. However, ER positivity is reduced and ESR1 mutations play an important role in resistance to endocrine therapy, leading to advanced breast cancer. Various methodologies for the detection of ESR1 mutations have been developed, and the most commonly used method is next-generation sequencing (NGS)-based assays (50.0%) followed by droplet digital PCR (ddPCR) (45.5%). Regarding the sample type, tissue (50.0%) was more frequently used than plasma (27.3%). However, plasma (46.2%) became the most used method in 2016-2019, in contrast to 2012-2015 (22.2%). In 2016-2019, ddPCR (61.5%), rather than NGS (30.8%), became a more popular method than it was in 2012-2015. The easy accessibility, non-invasiveness, and demonstrated usefulness with high sensitivity of ddPCR using plasma have changed the trends. When using these assays, there should be a comprehensive understanding of the principles, advantages, vulnerability, and precautions for interpretation. In the future, advanced NGS platforms and modified ddPCR will benefit patients by facilitating treatment decisions efficiently based on information regarding ESR1 mutations.
Collapse
Affiliation(s)
- Nuri Lee
- Department of Laboratory Medicine, Kangnam Sacred Heart Hospital, Hallym University College of Medicine, Seoul 07440, Korea; (N.L.); (M.-J.P.); (W.S.)
| | - Min-Jeong Park
- Department of Laboratory Medicine, Kangnam Sacred Heart Hospital, Hallym University College of Medicine, Seoul 07440, Korea; (N.L.); (M.-J.P.); (W.S.)
| | - Wonkeun Song
- Department of Laboratory Medicine, Kangnam Sacred Heart Hospital, Hallym University College of Medicine, Seoul 07440, Korea; (N.L.); (M.-J.P.); (W.S.)
| | - Kibum Jeon
- Department of Laboratory Medicine, Hangang Sacred Heart Hospital, Hallym University College of Medicine, Seoul 07440, Korea;
| | - Seri Jeong
- Department of Laboratory Medicine, Kangnam Sacred Heart Hospital, Hallym University College of Medicine, Seoul 07440, Korea; (N.L.); (M.-J.P.); (W.S.)
- Correspondence: ; Tel.: +82-845-5305
| |
Collapse
|
16
|
Quan W, Guan D, Quan G, Liu B, Wang Y. Short Read Alignment Based on Maximal Approximate Match Seeds. Front Mol Biosci 2020; 7:572934. [PMID: 33251246 PMCID: PMC7674947 DOI: 10.3389/fmolb.2020.572934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 10/09/2020] [Indexed: 11/13/2022] Open
Abstract
Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.
Collapse
Affiliation(s)
- Wei Quan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Dengfeng Guan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Guangri Quan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
- *Correspondence: Yadong Wang
| |
Collapse
|
17
|
Garimella KV, Iqbal Z, Krause MA, Campino S, Kekre M, Drury E, Kwiatkowski D, Sá JM, Wellems TE, McVean G. Detection of simple and complex de novo mutations with multiple reference sequences. Genome Res 2020; 30:1154-1169. [PMID: 32817236 PMCID: PMC7462078 DOI: 10.1101/gr.255505.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 07/17/2020] [Indexed: 12/25/2022]
Abstract
The characterization of de novo mutations in regions of high sequence and structural diversity from whole-genome sequencing data remains highly challenging. Complex structural variants tend to arise in regions of high repetitiveness and low complexity, challenging both de novo assembly, in which short reads do not capture the long-range context required for resolution, and mapping approaches, in which improper alignment of reads to a reference genome that is highly diverged from that of the sample can lead to false or partial calls. Long-read technologies can potentially solve such problems but are currently unfeasible to use at scale. Here we present Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant. We construct multisample, colored de Bruijn graphs from short-read data for all samples, align long-read–derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms and a model for simultaneous alignment and recombination. We validate and evaluate the approach using extensive simulations and use it to characterize the rate and spectrum of de novo mutation events in 119 progeny from four Plasmodium falciparum experimental crosses, using long-read data on the parents to inform reconstructions of the progeny and to detect several known and novel nonallelic homologous recombination events.
Collapse
Affiliation(s)
- Kiran V Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Michael A Krause
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom.,Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Susana Campino
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Mihir Kekre
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Eleanor Drury
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Dominic Kwiatkowski
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Juliana M Sá
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Thomas E Wellems
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| |
Collapse
|
18
|
Park SY, Jeon J, Kim JA, Jeon MJ, Jeong MH, Kim Y, Lee Y, Chung H, Lee YH, Kim S. Draft Genome Sequence of Alternaria alternata JS-1623, a Fungal Endophyte of Abies koreana. MYCOBIOLOGY 2020; 48:240-244. [PMID: 37970559 PMCID: PMC10635108 DOI: 10.1080/12298093.2020.1756134] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 03/31/2020] [Accepted: 04/08/2020] [Indexed: 11/17/2023]
Abstract
Alternaria alternata JS-1623 is an endophytic fungus isolated from a stem tissue of Korean fir, Abies koreana. Ethyl acetate extracts of culture filtrates exhibited anti-inflammatory activity in LPS induced microglia BV-2 cell without cytotoxicity. Here we report a 33.67 Mb sized genome assembly of JS-1623 comprised of 13 scaffolds with N50 of 4.96 Mb, and 92.41% of BUSCO completeness. GC contents were 50.97%. Of the 11,197 genes annotated, gene families related to the biosynthesis of secondary metabolites or transcription factors were identified.
Collapse
Affiliation(s)
- Sook-Young Park
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Jongbum Jeon
- Department of Agricultural Biotechnology, Interdisciplinary Program in Agricultural Genomics, Center for Fungal Genetic Resources, and Center for Fungal Pathogenesis, Seoul National University, Seoul, Korea
| | - Jung A. Kim
- Microbiology Resources Division, National Institute of Biological Resources, Incheon, Korea
| | - Mi Jin Jeon
- Microbiology Resources Division, National Institute of Biological Resources, Incheon, Korea
| | - Min-Hye Jeong
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Youngmin Kim
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Yerim Lee
- Department of Plant Medicine, Sunchon National University, Suncheon, Korea
| | - Hyunjung Chung
- Department of Agricultural Biotechnology, Interdisciplinary Program in Agricultural Genomics, Center for Fungal Genetic Resources, and Center for Fungal Pathogenesis, Seoul National University, Seoul, Korea
| | - Yong-Hwan Lee
- Department of Agricultural Biotechnology, Interdisciplinary Program in Agricultural Genomics, Center for Fungal Genetic Resources, and Center for Fungal Pathogenesis, Seoul National University, Seoul, Korea
| | - Soonok Kim
- Microbiology Resources Division, National Institute of Biological Resources, Incheon, Korea
| |
Collapse
|
19
|
Pereira R, Oliveira J, Sousa M. Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. J Clin Med 2020; 9:E132. [PMID: 31947757 PMCID: PMC7019349 DOI: 10.3390/jcm9010132] [Citation(s) in RCA: 119] [Impact Index Per Article: 23.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 12/15/2019] [Accepted: 12/30/2019] [Indexed: 12/13/2022] Open
Abstract
Clinical genetics has an important role in the healthcare system to provide a definitive diagnosis for many rare syndromes. It also can have an influence over genetics prevention, disease prognosis and assisting the selection of the best options of care/treatment for patients. Next-generation sequencing (NGS) has transformed clinical genetics making possible to analyze hundreds of genes at an unprecedented speed and at a lower price when comparing to conventional Sanger sequencing. Despite the growing literature concerning NGS in a clinical setting, this review aims to fill the gap that exists among (bio)informaticians, molecular geneticists and clinicians, by presenting a general overview of the NGS technology and workflow. First, we will review the current NGS platforms, focusing on the two main platforms Illumina and Ion Torrent, and discussing the major strong points and weaknesses intrinsic to each platform. Next, the NGS analytical bioinformatic pipelines are dissected, giving some emphasis to the algorithms commonly used to generate process data and to analyze sequence variants. Finally, the main challenges around NGS bioinformatics are placed in perspective for future developments. Even with the huge achievements made in NGS technology and bioinformatics, further improvements in bioinformatic algorithms are still required to deal with complex and genetically heterogeneous disorders.
Collapse
Affiliation(s)
- Rute Pereira
- Laboratory of Cell Biology, Department of Microscopy, Institute of Biomedical Sciences Abel Salazar (ICBAS), University of Porto (UP), 4050-313 Porto, Portugal;
- Biology and Genetics of Reproduction Unit, Multidisciplinary Unit for Biomedical Research (UMIB), ICBAS-UP, 4050-313 Porto, Portugal;
| | - Jorge Oliveira
- Biology and Genetics of Reproduction Unit, Multidisciplinary Unit for Biomedical Research (UMIB), ICBAS-UP, 4050-313 Porto, Portugal;
- UnIGENe and CGPP–Centre for Predictive and Preventive Genetics-Institute for Molecular and Cell Biology (IBMC), i3S-Institute for Research and Innovation in Health-UP, 4200-135 Porto, Portugal
| | - Mário Sousa
- Laboratory of Cell Biology, Department of Microscopy, Institute of Biomedical Sciences Abel Salazar (ICBAS), University of Porto (UP), 4050-313 Porto, Portugal;
- Biology and Genetics of Reproduction Unit, Multidisciplinary Unit for Biomedical Research (UMIB), ICBAS-UP, 4050-313 Porto, Portugal;
| |
Collapse
|
20
|
Paul AJ, Lawrence D, Song M, Lim SH, Pan C, Ahn TH. Using Apache Spark on genome assembly for scalable overlap-graph reduction. Hum Genomics 2019; 13:48. [PMID: 31639049 PMCID: PMC6805285 DOI: 10.1186/s40246-019-0227-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023] Open
Abstract
Background De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. Results To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. Conclusions We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.
Collapse
Affiliation(s)
- Alexander J Paul
- Bioinformatics and Computational Biology Program, Saint Louis University, St. Louis, MO, USA
| | - Dylan Lawrence
- Computational and Systems Biology Program, Washington University in St. Louis, St. Louis, MO, USA
| | - Myoungkyu Song
- Department of Computer Science, University of Nebraska at Omaha, Omaha, NE, USA
| | - Seung-Hwan Lim
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Chongle Pan
- School of Computer Science, University of Oklahoma, Norman, OK, USA
| | - Tae-Hyuk Ahn
- Bioinformatics and Computational Biology Program, Saint Louis University, St. Louis, MO, USA. .,Department of Computer Science, Saint Louis University, St. Louis, MO, USA.
| |
Collapse
|
21
|
Lightbody G, Haberland V, Browne F, Taggart L, Zheng H, Parkes E, Blayney JK. Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application. Brief Bioinform 2019; 20:1795-1811. [PMID: 30084865 PMCID: PMC6917217 DOI: 10.1093/bib/bby051] [Citation(s) in RCA: 99] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 05/01/2018] [Indexed: 12/28/2022] Open
Abstract
There has been an exponential growth in the performance and output of sequencing technologies (omics data) with full genome sequencing now producing gigabases of reads on a daily basis. These data may hold the promise of personalized medicine, leading to routinely available sequencing tests that can guide patient treatment decisions. In the era of high-throughput sequencing (HTS), computational considerations, data governance and clinical translation are the greatest rate-limiting steps. To ensure that the analysis, management and interpretation of such extensive omics data is exploited to its full potential, key factors, including sample sourcing, technology selection and computational expertise and resources, need to be considered, leading to an integrated set of high-performance tools and systems. This article provides an up-to-date overview of the evolution of HTS and the accompanying tools, infrastructure and data management approaches that are emerging in this space, which, if used within in a multidisciplinary context, may ultimately facilitate the development of personalized medicine.
Collapse
Affiliation(s)
- Gaye Lightbody
- School of Computing, Ulster University, Newtownabbey, UK
| | - Valeriia Haberland
- MRC Integrative Epidemiology Unit, Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - Fiona Browne
- School of Computing, Ulster University, Newtownabbey, UK
| | | | - Huiru Zheng
- School of Computing, Ulster University, Newtownabbey, UK
| | - Eileen Parkes
- Centre for Cancer Research & Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, UK
| | - Jaine K Blayney
- Centre for Cancer Research & Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, UK
| |
Collapse
|
22
|
Accurate high throughput alignment via line sweep-based seed processing. Nat Commun 2019; 10:1939. [PMID: 31028275 PMCID: PMC6486643 DOI: 10.1038/s41467-019-09977-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Accepted: 04/10/2019] [Indexed: 11/08/2022] Open
Abstract
Accurate and fast aligners are required to handle the steadily increasing volume of sequencing data. Here we present an approach allowing performant alignments of short reads (Illumina) as well as long reads (Pacific Bioscience, Ultralong Oxford Nanopore), while achieving high accuracy, based on a universal three-stage scheme. It is also suitable for the discovery of insertions and deletions that originate from structural variants. We comprehensively compare our approach to other state-of-the-art aligners in order to confirm its performance with respect to accuracy and runtime. As part of our algorithmic scheme, we introduce two line sweep-based techniques called "strip of consideration" and "seed harmonization". These techniques represent a replacement for chaining and do not rely on any specially tailored data structures. Additionally, we propose a refined form of seeding on the foundation of the FMD-index.
Collapse
|
23
|
Abstract
Background Single nucleotide polymorphisms (SNP) have been applied as important molecular markers in genetics and breeding studies. The rapid advance of next generation sequencing (NGS) provides a high-throughput means of SNP discovery. However, SNP development is limited by the availability of reliable SNP discovery methods. Especially, the optimum assembler and SNP caller for accurate SNP prediction from next generation sequencing data are not known. Results Herein we performed SNP prediction based on RNA-seq data of peach and mandarin peel tissue under a comprehensive comparison of two paired-end read lengths (125 bp and 150 bp), five assemblers (Trinity, IDBA, oases, SOAPdenovo, Trans-abyss) and two SNP callers (GATK and GBS). The predicted SNPs were compared with the authentic SNPs identified via PCR amplification followed by gene cloning and sequencing procedures. A total of 40 and 240 authentic SNPs were presented in five anthocyanin biosynthesis related genes in peach and in nine carotenogenic genes in mandarin. Putative SNPs predicted from the same RNA-seq data with different strategies led to quite divergent results. The rate of false positive SNPs was significantly lower when the paired-end read length was 150 bp compared with 125 bp. Trinity was superior to the other four assemblers and GATK was substantially superior to GBS due to a low rate of missing authentic SNPs. The combination of assembler Trinity, SNP caller GATK, and the paired-end read length 150 bp had the best performance in SNP discovery with 100% accuracy both in peach and in mandarin cases. This strategy was applied to the characterization of SNPs in peach and mandarin transcriptomes. Conclusions Through comparison of authentic SNPs obtained by PCR cloning strategy and putative SNPs predicted from different combinations of five assemblers, two SNP callers, and two paired-end read lengths, we provided a reliable and efficient strategy, Trinity-GATK with 150 bp paired-end read length, for SNP discovery from RNA-seq data. This strategy discovered SNP at 100% accuracy in peach and mandarin cases and might be applicable to a wide range of plants and other organisms. Electronic supplementary material The online version of this article (10.1186/s12864-019-5533-4) contains supplementary material, which is available to authorized users.
Collapse
|
24
|
Butyrivibrio hungatei MB2003 Competes Effectively for Soluble Sugars Released by Butyrivibrio proteoclasticus B316 T during Growth on Xylan or Pectin. Appl Environ Microbiol 2019; 85:AEM.02056-18. [PMID: 30478228 PMCID: PMC6344614 DOI: 10.1128/aem.02056-18] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 10/29/2018] [Indexed: 11/25/2022] Open
Abstract
Feeding a future global population of 9 billion people and climate change are the primary challenges facing agriculture today. Ruminant livestock are important food-producing animals, and maximizing their productivity requires an understanding of their digestive systems and the roles played by rumen microbes in plant polysaccharide degradation. Butyrivibrio species are a phylogenetically diverse group of bacteria and are commonly found in the rumen, where they are a substantial source of polysaccharide-degrading enzymes for the depolymerization of lignocellulosic material. Our findings suggest that closely related species of Butyrivibrio have developed unique strategies for the degradation of plant fiber and the subsequent assimilation of carbohydrates in order to coexist in the competitive rumen environment. The identification of genes expressed during these competitive interactions gives further insight into the enzymatic machinery used by these bacteria as they degrade the xylan and pectin components of plant fiber. Rumen bacterial species belonging to the genus Butyrivibrio are important degraders of plant polysaccharides, particularly hemicelluloses (arabinoxylans) and pectin. Currently, four species are recognized; they have very similar substrate utilization profiles, but little is known about how these microorganisms are able to coexist in the rumen. To investigate this question, Butyrivibrio hungatei MB2003 and Butyrivibrio proteoclasticus B316T were grown alone or in coculture on xylan or pectin, and their growth, release of sugars, fermentation end products, and transcriptomes were examined. In monocultures, B316T was able to grow well on xylan and pectin, while MB2003 was unable to utilize either of these insoluble substrates to support significant growth. Cocultures of B316T grown with MB2003 revealed that MB2003 showed growth almost equivalent to that of B316T when either xylan or pectin was supplied as the substrate. The effect of coculture on the transcriptomes of B316T and MB2003 was assessed; B316T transcription was largely unaffected by the presence of MB2003, but MB2003 expressed a wide range of genes encoding proteins for carbohydrate degradation, central metabolism, oligosaccharide transport, and substrate assimilation, in order to compete with B316T for the released sugars. These results suggest that B316T has a role as an initiator of primary solubilization of xylan and pectin, while MB2003 competes effectively for the released soluble sugars to enable its growth and maintenance in the rumen. IMPORTANCE Feeding a future global population of 9 billion people and climate change are the primary challenges facing agriculture today. Ruminant livestock are important food-producing animals, and maximizing their productivity requires an understanding of their digestive systems and the roles played by rumen microbes in plant polysaccharide degradation. Butyrivibrio species are a phylogenetically diverse group of bacteria and are commonly found in the rumen, where they are a substantial source of polysaccharide-degrading enzymes for the depolymerization of lignocellulosic material. Our findings suggest that closely related species of Butyrivibrio have developed unique strategies for the degradation of plant fiber and the subsequent assimilation of carbohydrates in order to coexist in the competitive rumen environment. The identification of genes expressed during these competitive interactions gives further insight into the enzymatic machinery used by these bacteria as they degrade the xylan and pectin components of plant fiber.
Collapse
|
25
|
Karamitros T, van Wilgenburg B, Wills M, Klenerman P, Magiorkinis G. Nanopore sequencing and full genome de novo assembly of human cytomegalovirus TB40/E reveals clonal diversity and structural variations. BMC Genomics 2018; 19:577. [PMID: 30068288 PMCID: PMC6090854 DOI: 10.1186/s12864-018-4949-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 07/19/2018] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Human cytomegalovirus (HCMV) has a double-stranded DNA genome of approximately 235 Kbp that is structurally complex including extended GC-rich repeated regions. Genomic recombination events are frequent in HCMV cultures but have also been observed in vivo. Thus, the assembly of HCMV whole genomes from technologies producing shorter than 500 bp sequences is technically challenging. Here we improved the reconstruction of HCMV full genomes by means of a hybrid, de novo genome-assembly bioinformatics pipeline upon data generated from the recently released MinION MkI B sequencer from Oxford Nanopore Technologies. RESULTS The MinION run of the HCMV (strain TB40/E) library resulted in ~ 47,000 reads from a single R9 flowcell and in ~ 100× average read depth across the virus genome. We developed a novel, self-correcting bioinformatics algorithm to assemble the pooled HCMV genomes in three stages. In the first stage of the bioinformatics algorithm, long contigs (N50 = 21,892) of lower accuracy were reconstructed. In the second stage, short contigs (N50 = 5686) of higher accuracy were assembled, while in the final stage the high quality contigs served as template for the correction of the longer contigs resulting in a high-accuracy, full genome assembly (N50 = 41,056). We were able to reconstruct a single representative haplotype without employing any scaffolding steps. The majority (98.8%) of the genomic features from the reference strain were accurately annotated on this full genome construct. Our method also allowed the detection of multiple alternative sub-genomic fragments and non-canonical structures suggesting rearrangement events between the unique (UL /US) and the repeated (T/IRL/S) genomic regions. CONCLUSIONS Third generation high-throughput sequencing technologies can accurately reconstruct full-length HCMV genomes including their low-complexity and highly repetitive regions. Full-length HCMV genomes could prove crucial in understanding the genetic determinants and viral evolution underpinning drug resistance, virulence and pathogenesis.
Collapse
Affiliation(s)
- Timokratis Karamitros
- Department of Zoology, University of Oxford, Oxford, United Kingdom. .,Public Health Laboratories, Department of Microbiology, Hellenic Pasteur Institute, 127 Vas Sofias Ave, 11527, Athens, Greece.
| | - Bonnie van Wilgenburg
- Nuffield Department of Clinical Medicine, University of Oxford, Oxford, United Kingdom
| | - Mark Wills
- Department of Medicine, University of Cambridge, Cambridge, United Kingdom
| | - Paul Klenerman
- Nuffield Department of Clinical Medicine, University of Oxford, Oxford, United Kingdom.,NIHR Biomedical Research Centre, Oxford, United Kingdom
| | - Gkikas Magiorkinis
- Department of Zoology, University of Oxford, Oxford, United Kingdom. .,Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, M. Asias 75 str., 11527, Athens, Greece.
| |
Collapse
|
26
|
Chen Q, Lan C, Zhao L, Wang J, Chen B, Chen YPP. Recent advances in sequence assembly: principles and applications. Brief Funct Genomics 2018; 16:361-378. [PMID: 28453648 DOI: 10.1093/bfgp/elx006] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The application of advanced sequencing technologies and the rapid growth of various sequence data have led to increasing interest in DNA sequence assembly. However, repeats and polymorphism occur frequently in genomes, and each of these has different impacts on assembly. Further, many new applications for sequencing, such as metagenomics regarding multiple species, have emerged in recent years. These not only give rise to higher complexity but also prevent short-read assembly in an efficient way. This article reviews the theoretical foundations that underlie current mapping-based assembly and de novo-based assembly, and highlights the key issues and feasible solutions that need to be considered. It focuses on how individual processes, such as optimal k-mer determination and error correction in assembly, rely on intelligent strategies or high-performance computation. We also survey primary algorithms/software and offer a discussion on the emerging challenges in assembly.
Collapse
|
27
|
Comparative Genomics Shows That Mycobacterium ulcerans Migration and Expansion Preceded the Rise of Buruli Ulcer in Southeastern Australia. Appl Environ Microbiol 2018; 84:AEM.02612-17. [PMID: 29439984 DOI: 10.1128/aem.02612-17] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Accepted: 01/25/2018] [Indexed: 02/07/2023] Open
Abstract
Since 2000, cases of the neglected tropical disease Buruli ulcer, caused by infection with Mycobacterium ulcerans, have increased 100-fold around Melbourne (population 4.4 million), the capital of Victoria, in temperate southeastern Australia. The reasons for this increase are unclear. Here, we used whole-genome sequence comparisons of 178 M. ulcerans isolates obtained primarily from human clinical specimens, spanning 70 years, to model the population dynamics of this pathogen from this region. Using phylogeographic and advanced Bayesian phylogenetic approaches, we found that there has been a migration of the pathogen from the east end of the state, beginning in the 1980s, 300 km west to the major human population center around Melbourne. This move was then followed by a significant increase in M. ulcerans population size. These analyses inform our thinking around Buruli ulcer transmission and control, indicating that M. ulcerans is introduced to a new environment and then expands, rather than it being from the awakening of a quiescent pathogen reservoir.IMPORTANCE Buruli ulcer is a destructive skin and soft tissue infection caused by Mycobacterium ulcerans and is characterized by progressive skin ulceration, which can lead to permanent disfigurement and long-term disability. Despite the majority of disease burden occurring in regions of West and central Africa, Buruli ulcer is also becoming increasingly common in southeastern Australia. Major impediments to controlling disease spread are incomplete understandings of the environmental reservoirs and modes of transmission of M. ulcerans The significance of our research is that we used genomics to assess the population structure of this pathogen at the Australian continental scale. We have then reconstructed a historical bacterial spread and modeled demographic dynamics to reveal bacterial population expansion across southeastern Australia. These findings provide explanations for the observed epidemiological trends with Buruli ulcer and suggest possible management to control disease spread.
Collapse
|
28
|
Genomic tools for behavioural ecologists to understand repeatable individual differences in behaviour. Nat Ecol Evol 2018; 2:944-955. [PMID: 29434349 DOI: 10.1038/s41559-017-0411-4] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2017] [Accepted: 11/10/2017] [Indexed: 12/28/2022]
Abstract
Behaviour is a key interface between an animal's genome and its environment. Repeatable individual differences in behaviour have been extensively documented in animals, but the molecular underpinnings of behavioural variation among individuals within natural populations remain largely unknown. Here, we offer a critical review of when molecular techniques may yield new insights, and we provide specific guidance on how and whether the latest tools available are appropriate given different resources, system and organismal constraints, and experimental designs. Integrating molecular genetic techniques with other strategies to study the proximal causes of behaviour provides opportunities to expand rapidly into new avenues of exploration. Such endeavours will enable us to better understand how repeatable individual differences in behaviour have evolved, how they are expressed and how they can be maintained within natural populations of animals.
Collapse
|
29
|
Chen H, Jiang Y, Maxwell KN, Nathanson KL, Zhang N. ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 2017; 11:1169-1192. [PMID: 28989557 DOI: 10.1214/17-aoas1043] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Whole exome sequencing is currently a technology of choice in large-scale cancer genomics studies, where the priority is to identify cancer-associated variants in coding regions. We describe a method for estimating allele-specific copy number using whole exome sequencing data from tumor and matched normal.
Collapse
|
30
|
Baichoo S, Ouzounis CA. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment. Biosystems 2017; 156-157:72-85. [PMID: 28392341 DOI: 10.1016/j.biosystems.2017.03.003] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 03/21/2017] [Accepted: 03/22/2017] [Indexed: 12/12/2022]
Abstract
A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality.
Collapse
Affiliation(s)
- Shakuntala Baichoo
- Department of Computer Science & Engineering, University of Mauritius, Réduit 80837, Mauritius.
| | - Christos A Ouzounis
- Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica 57001, Greece.
| |
Collapse
|
31
|
Abstract
Here, I argue that computational thinking and techniques are so central to the quest of understanding life that today all biology is computational biology. Computational biology brings order into our understanding of life, it makes biological concepts rigorous and testable, and it provides a reference map that holds together individual insights. The next modern synthesis in biology will be driven by mathematical, statistical, and computational methods being absorbed into mainstream biological training, turning biology into a quantitative science.
Collapse
Affiliation(s)
- Florian Markowetz
- University of Cambridge, Cancer Research UK Cambridge Institute, Cambridge, United Kingdom
| |
Collapse
|
32
|
Tian S, Yan H, Neuhauser C, Slager SL. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics 2016; 17:703. [PMID: 27590916 PMCID: PMC5010666 DOI: 10.1186/s12864-016-3045-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Accepted: 08/25/2016] [Indexed: 02/07/2023] Open
Abstract
Background Current variant discovery methods often start with the mapping of short reads to a reference genome; yet, their performance deteriorates in genomic regions where the reads are highly divergent from the reference sequence. This is particularly problematic for the human leukocyte antigen (HLA) region on chromosome 6p21.3. This region is associated with over 100 diseases, but variant calling is hindered by the extreme divergence across different haplotypes. Results We simulated reads from chromosome 6 exonic regions over a wide range of sequence divergence and coverage depth. We systematically assessed combinations between five mappers and five callers for their performance on simulated data and exome-seq data from NA12878, a well-studied individual in which multiple public call sets have been generated. Among those combinations, the number of known SNPs differed by about 5 % in the non-HLA regions of chromosome 6 but over 20 % in the HLA region. Notably, GSNAP mapping combined with GATK UnifiedGenotyper calling identified about 20 % more known SNPs than most existing methods without a noticeable loss of specificity, with 100 % sensitivity in three highly polymorphic HLA genes examined. Much larger differences were observed among these combinations in INDEL calling from both non-HLA and HLA regions. We obtained similar results with our internal exome-seq data from a cohort of chronic lymphocytic leukemia patients. Conclusions We have established a workflow enabling variant detection, with high sensitivity and specificity, over the full spectrum of divergence seen in the human genome. Comparing to public call sets from NA12878 has highlighted the overall superiority of GATK UnifiedGenotyper, followed by GATK HaplotypeCaller and SAMtools, in SNP calling, and of GATK HaplotypeCaller and Platypus in INDEL calling, particularly in regions of high sequence divergence such as the HLA region. GSNAP and Novoalign are the ideal mappers in combination with the above callers. We expect that the proposed workflow should be applicable to variant discovery in other highly divergent regions. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3045-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shulan Tian
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Huihuang Yan
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA
| | - Claudia Neuhauser
- Informatics Institute, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Susan L Slager
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st St SW, Rochester, MN, 55905, USA.
| |
Collapse
|
33
|
Swaminathan S, Sundaramurthi JC, Palaniappan AN, Narayanan S. Recent developments in genomics, bioinformatics and drug discovery to combat emerging drug-resistant tuberculosis. Tuberculosis (Edinb) 2016; 101:31-40. [PMID: 27865394 DOI: 10.1016/j.tube.2016.08.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Revised: 05/21/2016] [Accepted: 08/08/2016] [Indexed: 11/16/2022]
Abstract
Emergence of drug-resistant tuberculosis (DR-TB) is a big challenge in TB control. The delay in diagnosis of DR-TB leads to its increased transmission, and therefore prevalence. Recent developments in genomics have enabled whole genome sequencing (WGS) of Mycobacterium tuberculosis (M. tuberculosis) from 3-day-old liquid culture and directly from uncultured sputa, while new bioinformatics tools facilitate to determine DR mutations rapidly from the resulting sequences. The present drug discovery and development pipeline is filled with candidate drugs which have shown efficacy against DR-TB. Furthermore, some of the FDA-approved drugs are being evaluated for repurposing, and this approach appears promising as several drugs are reported to enhance efficacy of the standard TB drugs, reduce drug tolerance, or modulate the host immune response to control the growth of intracellular M. tuberculosis. Recent developments in genomics and bioinformatics along with new drug discovery collectively have the potential to result in synergistic impact leading to the development of a rapid protocol to determine the drug resistance profile of the infecting strain so as to provide personalized medicine. Hence, in this review, we discuss recent developments in WGS, bioinformatics and drug discovery to perceive how they would transform the management of tuberculosis in a timely manner.
Collapse
Affiliation(s)
- Soumya Swaminathan
- National Institute for Research in Tuberculosis (ICMR), Chetpet, Chennai, 600031, India.
| | - Jagadish Chandrabose Sundaramurthi
- Division of Biomedical Informatics, Department of Clinical Research, National Institute for Research in Tuberculosis (ICMR), Chetpet, Chennai, 600031, India
| | - Alangudi Natarajan Palaniappan
- Department of Clinical Research, National Institute for Research in Tuberculosis (ICMR), Chetpet, Chennai, 600031, India
| | - Sujatha Narayanan
- Department of Immunology, National Institute for Research in Tuberculosis (ICMR), Chetpet, Chennai, 600031, India
| |
Collapse
|
34
|
Cammen KM, Andrews KR, Carroll EL, Foote AD, Humble E, Khudyakov JI, Louis M, McGowen MR, Olsen MT, Van Cise AM. Genomic Methods Take the Plunge: Recent Advances in High-Throughput Sequencing of Marine Mammals. J Hered 2016; 107:481-95. [PMID: 27511190 DOI: 10.1093/jhered/esw044] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2016] [Accepted: 07/12/2016] [Indexed: 12/18/2022] Open
Abstract
The dramatic increase in the application of genomic techniques to non-model organisms (NMOs) over the past decade has yielded numerous valuable contributions to evolutionary biology and ecology, many of which would not have been possible with traditional genetic markers. We review this recent progression with a particular focus on genomic studies of marine mammals, a group of taxa that represent key macroevolutionary transitions from terrestrial to marine environments and for which available genomic resources have recently undergone notable rapid growth. Genomic studies of NMOs utilize an expanding range of approaches, including whole genome sequencing, restriction site-associated DNA sequencing, array-based sequencing of single nucleotide polymorphisms and target sequence probes (e.g., exomes), and transcriptome sequencing. These approaches generate different types and quantities of data, and many can be applied with limited or no prior genomic resources, thus overcoming one traditional limitation of research on NMOs. Within marine mammals, such studies have thus far yielded significant contributions to the fields of phylogenomics and comparative genomics, as well as enabled investigations of fitness, demography, and population structure. Here we review the primary options for generating genomic data, introduce several emerging techniques, and discuss the suitability of each approach for different applications in the study of NMOs.
Collapse
Affiliation(s)
- Kristina M Cammen
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise).
| | - Kimberly R Andrews
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Emma L Carroll
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Andrew D Foote
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Emily Humble
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Jane I Khudyakov
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Marie Louis
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Michael R McGowen
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Morten Tange Olsen
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| | - Amy M Van Cise
- From the School of Marine Sciences, University of Maine, Orono, ME 04469 (Cammen); Department of Fish and Wildlife Sciences, University of Idaho, 875 Perimeter Drive MS 1136, Moscow, ID 83844-1136 (Andrews); Scottish Oceans Institute, University of St Andrews, East Sands, St Andrews, Fife KY16 8LB, UK (Carroll and Louis); Computational and Molecular Population Genetics Lab, Institute of Ecology and Evolution, University of Bern, Bern CH-3012, Switzerland (Foote); Department of Animal Behaviour, University of Bielefeld, Postfach 100131, 33501 Bielefeld, Germany (Humble); British Antarctic Survey, High Cross, Madingley Road, Cambridge CB3 OET, UK (Humble); Department of Biology, Sonoma State University, Rohnert Park, CA 94928 (Khudyakov); School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London E1 4NS, UK (Mcgowen); Evolutionary Genomics Section, Natural History Museum of Denmark, University of Copenhagen, DK-1353 Copenhagen K, Denmark (Olsen); and Scripps Institution of Oceanography, University of California San Diego, 8622 Kennel Way, La Jolla, CA 92037 (Van Cise)
| |
Collapse
|
35
|
Kim J, Maeng JH, Lim JS, Son H, Lee J, Lee JH, Kim S. Vecuum: identification and filtration of false somatic variants caused by recombinant vector contamination. Bioinformatics 2016; 32:3072-3080. [PMID: 27334474 DOI: 10.1093/bioinformatics/btw383] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2016] [Accepted: 06/14/2016] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Advances in sequencing technologies have remarkably lowered the detection limit of somatic variants to a low frequency. However, calling mutations at this range is still confounded by many factors including environmental contamination. Vector contamination is a continuously occurring issue and is especially problematic since vector inserts are hardly distinguishable from the sample sequences. Such inserts, which may harbor polymorphisms and engineered functional mutations, can result in calling false variants at corresponding sites. Numerous vector-screening methods have been developed, but none could handle contamination from inserts because they are focusing on vector backbone sequences alone. RESULTS We developed a novel method-Vecuum-that identifies vector-originated reads and resultant false variants. Since vector inserts are generally constructed from intron-less cDNAs, Vecuum identifies vector-originated reads by inspecting the clipping patterns at exon junctions. False variant calls are further detected based on the biased distribution of mutant alleles to vector-originated reads. Tests on simulated and spike-in experimental data validated that Vecuum could detect 93% of vector contaminants and could remove up to 87% of variant-like false calls with 100% precision. Application to public sequence datasets demonstrated the utility of Vecuum in detecting false variants resulting from various types of external contamination. AVAILABILITY AND IMPLEMENTATION Java-based implementation of the method is available at http://vecuum.sourceforge.net/ CONTACT: [email protected] information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Junho Kim
- Severance Biomedical Science Institute, Brain Korea 21 PLUS Project for Medical Sciences, Yonsei University College of Medicine, Seoul 03722, South Korea
| | - Ju Heon Maeng
- Severance Biomedical Science Institute, Brain Korea 21 PLUS Project for Medical Sciences, Yonsei University College of Medicine, Seoul 03722, South Korea
| | - Jae Seok Lim
- Graduate School of Medical Science and Engineering, KAIST, Daejeon 34141, South Korea
| | - Hyeonju Son
- Severance Biomedical Science Institute, Brain Korea 21 PLUS Project for Medical Sciences, Yonsei University College of Medicine, Seoul 03722, South Korea
| | - Junehawk Lee
- Department of Convergence Technology Research, Korea Institute of Science and Technology Information, Daejeon 34141, South Korea
| | - Jeong Ho Lee
- Graduate School of Medical Science and Engineering, KAIST, Daejeon 34141, South Korea
| | - Sangwoo Kim
- Severance Biomedical Science Institute, Brain Korea 21 PLUS Project for Medical Sciences, Yonsei University College of Medicine, Seoul 03722, South Korea
| |
Collapse
|
36
|
De Novo Assembly of Human Herpes Virus Type 1 (HHV-1) Genome, Mining of Non-Canonical Structures and Detection of Novel Drug-Resistance Mutations Using Short- and Long-Read Next Generation Sequencing Technologies. PLoS One 2016; 11:e0157600. [PMID: 27309375 PMCID: PMC4910999 DOI: 10.1371/journal.pone.0157600] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2016] [Accepted: 05/31/2016] [Indexed: 02/01/2023] Open
Abstract
Human herpesvirus type 1 (HHV-1) has a large double-stranded DNA genome of approximately 152 kbp that is structurally complex and GC-rich. This makes the assembly of HHV-1 whole genomes from short-read sequencing data technically challenging. To improve the assembly of HHV-1 genomes we have employed a hybrid genome assembly protocol using data from two sequencing technologies: the short-read Roche 454 and the long-read Oxford Nanopore MinION sequencers. We sequenced 18 HHV-1 cell culture-isolated clinical specimens collected from immunocompromised patients undergoing antiviral therapy. The susceptibility of the samples to several antivirals was determined by plaque reduction assay. Hybrid genome assembly resulted in a decrease in the number of contigs in 6 out of 7 samples and an increase in N(G)50 and N(G)75 of all 7 samples sequenced by both technologies. The approach also enhanced the detection of non-canonical contigs including a rearrangement between the unique (UL) and repeat (T/IRL) sequence regions of one sample that was not detectable by assembly of 454 reads alone. We detected several known and novel resistance-associated mutations in UL23 and UL30 genes. Genome-wide genetic variability ranged from <1% to 53% of amino acids in each gene exhibiting at least one substitution within the pool of samples. The UL23 gene had one of the highest genetic variabilities at 35.2% in keeping with its role in development of drug resistance. The assembly of accurate, full-length HHV-1 genomes will be useful in determining genetic determinants of drug resistance, virulence, pathogenesis and viral evolution. The numerous, complex repeat regions of the HHV-1 genome currently remain a barrier towards this goal.
Collapse
|
37
|
Zhang NR, Yakir B, Xia LC, Siegmund D. Scan statistics on Poisson random fields with applications in genomics. Ann Appl Stat 2016. [DOI: 10.1214/15-aoas892] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
38
|
Abstract
Next-generation sequencing (NGS) technologies have rapidly evolved in the last 5 years, leading to the generation of millions of short reads in a single run. Consequently, various sequence alignment algorithms have been developed to compare these reads to an appropriate reference in order to perform important downstream analysis. SOAP2 from the SOAP series is one of the most commonly used alignment programs to handle NGS data, and it efficiently does so using low computer memory usage and fast alignment speed. This chapter describes the protocol used to align short reads to a reference genome using SOAP2, and highlights the significance of using the in-built command-line options to tune the behavior of the algorithm according to the inputs and the desired results.
Collapse
|
39
|
Jung H, Yoon BH, Kim WJ, Kim DW, Hurwood DA, Lyons RE, Salin KR, Kim HS, Baek I, Chand V, Mather PB. Optimizing Hybrid de Novo Transcriptome Assembly and Extending Genomic Resources for Giant Freshwater Prawns (Macrobrachium rosenbergii): The Identification of Genes and Markers Associated with Reproduction. Int J Mol Sci 2016; 17:ijms17050690. [PMID: 27164098 PMCID: PMC4881516 DOI: 10.3390/ijms17050690] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2016] [Revised: 04/27/2016] [Accepted: 04/29/2016] [Indexed: 11/29/2022] Open
Abstract
The giant freshwater prawn, Macrobrachium rosenbergii, a sexually dimorphic decapod crustacean is currently the world’s most economically important cultured freshwater crustacean species. Despite its economic importance, there is currently a lack of genomic resources available for this species, and this has limited exploration of the molecular mechanisms that control the M. rosenbergii sex-differentiation system more widely in freshwater prawns. Here, we present the first hybrid transcriptome from M. rosenbergii applying RNA-Seq technologies directed at identifying genes that have potential functional roles in reproductive-related traits. A total of 13,733,210 combined raw reads (1720 Mbp) were obtained from Ion-Torrent PGM and 454 FLX. Bioinformatic analyses based on three state-of-the-art assemblers, the CLC Genomic Workbench, Trans-ABySS, and Trinity, that use single and multiple k-mer methods respectively, were used to analyse the data. The influence of multiple k-mers on assembly performance was assessed to gain insight into transcriptome assembly from short reads. After optimisation, de novo assembly resulted in 44,407 contigs with a mean length of 437 bp, and the assembled transcripts were further functionally annotated to detect single nucleotide polymorphisms and simple sequence repeat motifs. Gene expression analysis was also used to compare expression patterns from ovary and testis tissue libraries to identify genes with potential roles in reproduction and sex differentiation. The large transcript set assembled here represents the most comprehensive set of transcriptomic resources ever developed for reproduction traits in M. rosenbergii, and the large number of genetic markers predicted should constitute an invaluable resource for future genetic research studies on M. rosenbergii and can be applied more widely on other freshwater prawn species in the genus Macrobrachium.
Collapse
Affiliation(s)
- Hyungtaek Jung
- Centre for Tropical Crops and Biocommodities, Science and Engineering Faculty, Queensland University of Technology, Queensland 4000, Australia.
| | - Byung-Ha Yoon
- Korean Bioinformation Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305806, Korea.
- Department of Bioinformatics, University of Science and Technology, Daejeon 305333, Korea.
| | - Woo-Jin Kim
- Biotechnology Research Division, National Institute of Fisheries Science, Busan 46083, Korea.
| | - Dong-Wook Kim
- All Bio Technology Co., LTD, Internet Business Incubation Center, Mokweon University, Daejeon 302729, Korea.
| | - David A Hurwood
- Earth, Environmental and Biological Sciences, Science and Engineering Faculty, Queensland University of Technology, Queensland 4000, Australia.
| | - Russell E Lyons
- School of Veterinary Science, University of Queensland, Queensland 4067, Australia.
| | - Krishna R Salin
- School of Environment, Resources and Development, Asian Institute of Technology, Pathumthani 12120, Thailand.
| | - Heui-Soo Kim
- Department of Biological Sciences, College of Natural Sciences, Pusan National University, Busan 609735, Korea.
| | - Ilseon Baek
- Division of Marine Technology, Chonnam National University, Yeosu 550250, Korea.
| | - Vincent Chand
- Earth, Environmental and Biological Sciences, Science and Engineering Faculty, Queensland University of Technology, Queensland 4000, Australia.
| | - Peter B Mather
- Earth, Environmental and Biological Sciences, Science and Engineering Faculty, Queensland University of Technology, Queensland 4000, Australia.
| |
Collapse
|
40
|
Zukurov JP, do Nascimento-Brito S, Volpini AC, Oliveira GC, Janini LMR, Antoneli F. Estimation of genetic diversity in viral populations from next generation sequencing data with extremely deep coverage. Algorithms Mol Biol 2016; 11:2. [PMID: 26973707 PMCID: PMC4788855 DOI: 10.1186/s13015-016-0064-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Accepted: 02/25/2016] [Indexed: 12/16/2022] Open
Abstract
Background In this paper we propose a method and discuss its computational implementation as an integrated tool for the analysis of viral genetic diversity on data generated by high-throughput sequencing. The main motivation for this work is to better understand the genetic diversity of viruses with high rates of nucleotide substitution, as HIV-1 and Influenza. Most methods for viral diversity estimation proposed so far are intended to take benefit of the longer reads produced by some next-generation sequencing platforms in order to estimate a population of haplotypes which represent the diversity of the original population. The method proposed here is custom-made to take advantage of the very low error rate and extremely deep coverage per site, which are the main features of some neglected technologies that have not received much attention due to the short length of its reads, which precludes haplotype estimation. This approach allowed us to avoid some hard problems related to haplotype reconstruction (need of long reads, preliminary error filtering and assembly). Results We propose to measure genetic diversity of a viral population through a family of multinomial probability distributions indexed by the sites of the virus genome, each one representing the distribution of nucleic bases per site. Moreover, the implementation of the method focuses on two main optimization strategies: a read mapping/alignment procedure that aims at the recovery of the maximum possible number of short-reads; the inference of the multinomial parameters in a Bayesian framework with smoothed Dirichlet estimation. The Bayesian approach provides conditional probability distributions for the multinomial parameters allowing one to take into account the prior information of the control experiment and providing a natural way to separate signal from noise, since it automatically furnishes Bayesian confidence intervals and thus avoids the drawbacks of preliminary error filtering. Conclusions The methods described in this paper have been implemented as an integrated tool called Tanden (Tool for Analysis of Diversity in Viral Populations) and successfully tested on samples obtained from HIV-1 strain NL4-3 (group M, subtype B) cultivations on primary human cell cultures in many distinct viral propagation conditions. Tanden is written in C# (Microsoft), runs on the Windows operating system, and can be downloaded from: http://tanden.url.ph/.
Collapse
|
41
|
Agrawal S, Ganley ARD. Complete Sequence Construction of the Highly Repetitive Ribosomal RNA Gene Repeats in Eukaryotes Using Whole Genome Sequence Data. Methods Mol Biol 2016; 1455:161-181. [PMID: 27576718 DOI: 10.1007/978-1-4939-3792-9_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The ribosomal RNA genes (rDNA) encode the major rRNA species of the ribosome, and thus are essential across life. These genes are highly repetitive in most eukaryotes, forming blocks of tandem repeats that form the core of nucleoli. The primary role of the rDNA in encoding rRNA has been long understood, but more recently the rDNA has been implicated in a number of other important biological phenomena, including genome stability, cell cycle, and epigenetic silencing. Noncoding elements, primarily located in the intergenic spacer region, appear to mediate many of these phenomena. Although sequence information is available for the genomes of many organisms, in almost all cases rDNA repeat sequences are lacking, primarily due to problems in assembling these intriguing regions during whole genome assemblies. Here, we present a method to obtain complete rDNA repeat unit sequences from whole genome assemblies. Limitations of next generation sequencing (NGS) data make them unsuitable for assembling complete rDNA unit sequences; therefore, the method we present relies on the use of Sanger whole genome sequence data. Our method makes use of the Arachne assembler, which can assemble highly repetitive regions such as the rDNA in a memory-efficient way. We provide a detailed step-by-step protocol for generating rDNA sequences from whole genome Sanger sequence data using Arachne, for refining complete rDNA unit sequences, and for validating the sequences obtained. In principle, our method will work for any species where the rDNA is organized into tandem repeats. This will help researchers working on species without a complete rDNA sequence, those working on evolutionary aspects of the rDNA, and those interested in conducting phylogenetic footprinting studies with the rDNA.
Collapse
Affiliation(s)
- Saumya Agrawal
- Institute of Natural and Mathematical Sciences, Massey University, Private Bag 102-904, Auckland, 0632, New Zealand.
- School of Biological Sciences, University of Auckland, Auckland, New Zealand.
| | - Austen R D Ganley
- Institute of Natural and Mathematical Sciences, Massey University, Private Bag 102-904, Auckland, 0632, New Zealand.
- School of Biological Sciences, University of Auckland, Private Bag 92019, Auckland, 1142, New Zealand.
| |
Collapse
|
42
|
Roth W, Hecker D, Fava E. Systems Biology Approaches to the Study of Biological Networks Underlying Alzheimer's Disease: Role of miRNAs. Methods Mol Biol 2016; 1303:349-377. [PMID: 26235078 DOI: 10.1007/978-1-4939-2627-5_21] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
MicroRNAs (miRNAs) are emerging as significant regulators of mRNA complexity in the human central nervous system (CNS) thereby controlling distinct gene expression profiles in a spatio-temporal manner during development, neuronal plasticity, aging and (age-related) neurodegeneration, including Alzheimer's disease (AD). Increasing effort is expended towards dissecting and deciphering the molecular and genetic mechanisms of neurobiological and pathological functions of these brain-enriched miRNAs. Along these lines, recent data pinpoint distinct miRNAs and miRNA networks being linked to APP splicing, processing and Aβ pathology (Lukiw et al., Front Genet 3:327, 2013), and furthermore, to the regulation of tau and its cellular subnetworks (Lau et al., EMBO Mol Med 5:1613, 2013), altogether underlying the onset and propagation of Alzheimer's disease. MicroRNA profiling studies in Alzheimer's disease suffer from poor consensus which is an acknowledged concern in the field, and constitutes one of the current technical challenges. Hence, a strong demand for experimental and computational systems biology approaches arises, to incorporate and integrate distinct levels of information and scientific knowledge into a complex system of miRNA networks in the context of the transcriptome, proteome and metabolome in a given cellular environment. Here, we will discuss the state-of-the-art technologies and computational approaches on hand that may lead to a deeper understanding of the complex biological networks underlying the pathogenesis of Alzheimer's disease.
Collapse
Affiliation(s)
- Wera Roth
- German Center for Neurodegenerative Diseases (DZNE), Ludwig-Erhard-Allee 2, 53175, Bonn, Germany
| | | | | |
Collapse
|
43
|
Abstract
Next-generation sequencing experiment can generate billions of short reads for each sample and processing of the raw reads will add more information. Various file formats have been introduced/developed in order to store and manipulate this information. This chapter presents an overview of the file formats including FASTQ, FASTA, SAM/BAM, GFF/GTF, BED, and VCF that are commonly used in analysis of next-generation sequencing data.
Collapse
Affiliation(s)
- Hongen Zhang
- Center for Cancer Research, National Cancer Institute, National Institutes of Health, 37 Convent Drive, Room 6138, Bethesda, MD, 20892, USA.
| |
Collapse
|
44
|
Zhang S, Bian Y, Zhang Z, Zheng H, Wang Z, Zha L, Cai J, Gao Y, Ji C, Hou Y, Li C. Parallel Analysis of 124 Universal SNPs for Human Identification by Targeted Semiconductor Sequencing. Sci Rep 2015; 5:18683. [PMID: 26691610 PMCID: PMC4687036 DOI: 10.1038/srep18683] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 11/23/2015] [Indexed: 12/20/2022] Open
Abstract
SNPs, abundant in human genome with lower mutation rate, are attractive to genetic application like forensic, anthropological and evolutionary studies. Universal SNPs showing little allelic frequency variation among populations while remaining highly informative for human identification were obtained from previous studies. However, genotyping tools target only dozens of markers simultaneously, limiting their applications. Here, 124 SNPs were simultaneous tested using Ampliseq technology with Ion Torrent PGM platform. Concordance study was performed with 2 reference samples of 9947A and 9948 between NGS and Sanger sequencing. Full concordance were obtained except genotype of rs576261 with 9947A. Parameter of FMAR (%) was introduced for NGS data analysis for the first time, evaluating allelic performance, sensitivity testing and mixture testing. FMAR values for accurate heterozygotes should be range from 50% to 60%, for homozygotes or Y-SNP should be above 90%. SNPs of rs7520386, rs4530059, rs214955, rs1523537, rs2342747, rs576261 and rs12997453 were recognized as poorly performing loci, either with allelic imbalance or with lower coverage. Sensitivity testing demonstrated that with DNA range from 10 ng-0.5 ng, all correct genotypes were obtained. For mixture testing, a clear linear correlation (R(2) = 0.9429) between the excepted FMAR and observed FMAR values of mixtures was observed.
Collapse
Affiliation(s)
- Suhua Zhang
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
- State Key Laboratory of Genetic Engineering, Institute of Genetics, School of Life Sciences, Fudan University, Shanghai 200433, P.R. China
| | - Yingnan Bian
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| | - Zheren Zhang
- Invitrogen Trading (Shanghai) Co., LTD, Shanghai 200050, P.R.China
| | - Hancheng Zheng
- Invitrogen Trading (Shanghai) Co., LTD, Shanghai 200050, P.R.China
| | - Zheng Wang
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| | - Lagabaiyila Zha
- Department of Forensic Science, School of Basic Medical Sciences, Central South University, Changsha 410013, P.R. China
| | - Jifeng Cai
- Department of Forensic Science, School of Basic Medical Sciences, Central South University, Changsha 410013, P.R. China
| | - Yuzhen Gao
- Department of Forensic Medicine, Medical College of Soochow University, Suzhou 215123, P.R. China
| | - Chaoneng Ji
- State Key Laboratory of Genetic Engineering, Institute of Genetics, School of Life Sciences, Fudan University, Shanghai 200433, P.R. China
| | - Yiping Hou
- Department of Forensic Genetics, West China School of Preclinical and Forensic Medicine, Sichuan University, Chengdu 610041, P.R.China
| | - Chengtao Li
- Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Sciences, Ministry of Justice, P.R. China, Shanghai 200063, P.R. China
| |
Collapse
|
45
|
Ye H, Meehan J, Tong W, Hong H. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine. Pharmaceutics 2015; 7:523-41. [PMID: 26610555 PMCID: PMC4695832 DOI: 10.3390/pharmaceutics7040523] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Revised: 11/14/2015] [Accepted: 11/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.
Collapse
Affiliation(s)
- Hao Ye
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Joe Meehan
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
46
|
Zhu X, Leung HCM, Wang R, Chin FYL, Yiu SM, Quan G, Li Y, Zhang R, Jiang Q, Liu B, Dong Y, Zhou G, Wang Y. misFinder: identify mis-assemblies in an unbiased manner using reference and paired-end reads. BMC Bioinformatics 2015; 16:386. [PMID: 26573684 PMCID: PMC4647709 DOI: 10.1186/s12859-015-0818-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2015] [Accepted: 11/06/2015] [Indexed: 11/10/2022] Open
Abstract
Background Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly, which may have adverse impact to the downstream data analysis. Several tools have been developed to eliminate these errors by either 1) comparing the assembled sequences with some similar reference genome, or 2) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences. However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false positive detections (correctly assembled sequence being considered as mis-assembled sequence). Results We present misFinder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way and correct these errors at their mis-assembled positions to improve the assembly accuracy for downstream analysis. It combines the information of reference (or close related reference) genome and aligned paired-end reads to the assembled sequence. Assembly errors and correct assemblies corresponding to structural variations can be detected by comparing the genome reference and assembled sequence. Different types of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls. Conclusions We tested the performance of misFinder on both simulated and real paired-end reads data, and misFinder gave accurate error calls with only very few miscalls. And, we further compared misFinder with QUAST and REAPR. misFinder outperformed QUAST and REAPR by 1) identified more true positive mis-assemblies with very few false positives and false negatives, and 2) distinguished the correct assemblies corresponding to structural variations from mis-assembled sequence. misFinder can be freely downloaded from https://github.com/hitbio/misFinder. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0818-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiao Zhu
- College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China. .,Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Henry C M Leung
- Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, China.
| | - Rongjie Wang
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Francis Y L Chin
- Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, China.
| | - Siu Ming Yiu
- Department of Computer Science, University of Hong Kong, Pokfulam Road, Hong Kong, China.
| | - Guangri Quan
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Yajie Li
- The Fourth Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang, China.
| | - Rui Zhang
- The Fourth Affiliated Hospital of Harbin Medical University, Harbin, Heilongjiang, China.
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Bo Liu
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| | - Yucui Dong
- Department of Immunology, Harbin Medical University, Harbin, Heilongjiang, China.
| | - Guohui Zhou
- College of Computer Sciences and Information Engineering, Harbin Normal University, Harbin, Heilongjiang, China.
| | - Yadong Wang
- Center for Bioinformatics, School of Computer Sciences and Technology, Harbin Institute of Technology, Harbin, Heilongjiang, China.
| |
Collapse
|
47
|
Han Y, Gao S, Muegge K, Zhang W, Zhou B. Advanced Applications of RNA Sequencing and Challenges. Bioinform Biol Insights 2015; 9:29-46. [PMID: 26609224 PMCID: PMC4648566 DOI: 10.4137/bbi.s28991] [Citation(s) in RCA: 130] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2015] [Revised: 09/30/2015] [Accepted: 10/02/2015] [Indexed: 12/18/2022] Open
Abstract
Next-generation sequencing technologies have revolutionarily advanced sequence-based research with the advantages of high-throughput, high-sensitivity, and high-speed. RNA-seq is now being used widely for uncovering multiple facets of transcriptome to facilitate the biological applications. However, the large-scale data analyses associated with RNA-seq harbors challenges. In this study, we present a detailed overview of the applications of this technology and the challenges that need to be addressed, including data preprocessing, differential gene expression analysis, alternative splicing analysis, variants detection and allele-specific expression, pathway analysis, co-expression network analysis, and applications combining various experimental procedures beyond the achievements that have been made. Specifically, we discuss essential principles of computational methods that are required to meet the key challenges of the RNA-seq data analyses, development of various bioinformatics tools, challenges associated with the RNA-seq applications, and examples that represent the advances made so far in the characterization of the transcriptome.
Collapse
Affiliation(s)
- Yixing Han
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, USA
| | - Shouguo Gao
- Bioinformatics and Systems Biology Core, National Heart Lung Blood Institute, National Institutes of Health, Rockville Pike, Bethesda, MD, USA
| | - Kathrin Muegge
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, USA. ; Leidos Biomedical Research, Inc., Basic Science Program, Frederick National Laboratory, Frederick, MD, USA
| | - Wei Zhang
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Bing Zhou
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
| |
Collapse
|
48
|
Xin H, Nahar S, Zhu R, Emmons J, Pekhimenko G, Kingsford C, Alkan C, Mutlu O. Optimal seed solver: optimizing seed selection in read mapping. ACTA ACUST UNITED AC 2015; 32:1632-42. [PMID: 26568624 DOI: 10.1093/bioinformatics/btv670] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2015] [Accepted: 11/09/2015] [Indexed: 11/12/2022]
Abstract
MOTIVATION Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the ability of the mapper in selecting less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds. RESULTS We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-base-pair read in [Formula: see text] operations on average and in [Formula: see text] operations in the worst case, while generating a maximum of [Formula: see text] seed frequency database lookups. We compare OSS against four state-of-the-art seed selection schemes and observe that OSS provides a 3-fold reduction in average seed frequency over the best previous seed selection optimizations. AVAILABILITY AND IMPLEMENTATION We provide an implementation of the Optimal Seed Solver in C++ at: https://github.com/CMU-SAFARI/Optimal-Seed-Solver CONTACT SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | - John Emmons
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63130, USA
| | | | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Bilkent, Ankara 06800, Turkey and
| | - Onur Mutlu
- Computer Science Department, Department of Electrical and Computer Engineering
| |
Collapse
|
49
|
Quek C, Jung CH, Bellingham SA, Lonie A, Hill AF. iSRAP - a one-touch research tool for rapid profiling of small RNA-seq data. J Extracell Vesicles 2015; 4:29454. [PMID: 26561006 PMCID: PMC4641893 DOI: 10.3402/jev.v4.29454] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2015] [Revised: 10/12/2015] [Accepted: 10/14/2015] [Indexed: 12/23/2022] Open
Abstract
Small non-coding RNAs have been significantly recognized as the key modulators in many biological processes, and are emerging as promising biomarkers for several diseases. These RNA species are transcribed in cells and can be packaged in extracellular vesicles, which are small vesicles released from many biotypes, and are involved in intercellular communication. Currently, the advent of next-generation sequencing (NGS) technology for high-throughput profiling has further advanced the biological insights of non-coding RNA on a genome-wide scale and has become the preferred approach for the discovery and quantification of non-coding RNA species. Despite the routine practice of NGS, the processing of large data sets poses difficulty for analysis before conducting downstream experiments. Often, the current analysis tools are designed for specific RNA species, such as microRNA, and are limited in flexibility for modifying parameters for optimization. An analysis tool that allows for maximum control of different software is essential for drawing concrete conclusions for differentially expressed transcripts. Here, we developed a one-touch integrated small RNA analysis pipeline (iSRAP) research tool that is composed of widely used tools for rapid profiling of small RNAs. The performance test of iSRAP using publicly and in-house available data sets shows its ability of comprehensive profiling of small RNAs of various classes, and analysis of differentially expressed small RNAs. iSRAP offers comprehensive analysis of small RNA sequencing data that leverage informed decisions on the downstream analyses of small RNA studies, including extracellular vesicles such as exosomes.
Collapse
Affiliation(s)
- Camelia Quek
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC, Australia
| | - Chol-Hee Jung
- Victorian Life Sciences Computation Initiative (VLSCI), The University of Melbourne, Melbourne, VIC, Australia
| | - Shayne A Bellingham
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC, Australia
| | - Andrew Lonie
- Victorian Life Sciences Computation Initiative (VLSCI), The University of Melbourne, Melbourne, VIC, Australia
| | - Andrew F Hill
- Department of Biochemistry and Molecular Biology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC, Australia.,Department of Biochemistry and Genetics, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC, Australia;
| |
Collapse
|
50
|
Lee S, Min H, Yoon S. Will solid-state drives accelerate your bioinformatics? In-depth profiling, performance analysis and beyond. Brief Bioinform 2015; 17:713-27. [PMID: 26330577 DOI: 10.1093/bib/bbv073] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Indexed: 11/12/2022] Open
Abstract
A wide variety of large-scale data have been produced in bioinformatics. In response, the need for efficient handling of biomedical big data has been partly met by parallel computing. However, the time demand of many bioinformatics programs still remains high for large-scale practical uses because of factors that hinder acceleration by parallelization. Recently, new generations of storage devices have emerged, such as NAND flash-based solid-state drives (SSDs), and with the renewed interest in near-data processing, they are increasingly becoming acceleration methods that can accompany parallel processing. In certain cases, a simple drop-in replacement of hard disk drives by SSDs results in dramatic speedup. Despite the various advantages and continuous cost reduction of SSDs, there has been little review of SSD-based profiling and performance exploration of important but time-consuming bioinformatics programs. For an informative review, we perform in-depth profiling and analysis of 23 key bioinformatics programs using multiple types of devices. Based on the insight we obtain from this research, we further discuss issues related to design and optimize bioinformatics algorithms and pipelines to fully exploit SSDs. The programs we profile cover traditional and emerging areas of importance, such as alignment, assembly, mapping, expression analysis, variant calling and metagenomics. We explain how acceleration by parallelization can be combined with SSDs for improved performance and also how using SSDs can expedite important bioinformatics pipelines, such as variant calling by the Genome Analysis Toolkit and transcriptome analysis using RNA sequencing. We hope that this review can provide useful directions and tips to accompany future bioinformatics algorithm design procedures that properly consider new generations of powerful storage devices.
Collapse
|