1
|
Zhang Z, Zhang L, Zhang G, Zhao Z, Wang H, Ju F. Deduplication Improves Cost-Efficiency and Yields of De Novo Assembly and Binning of Shotgun Metagenomes in Microbiome Research. Microbiol Spectr 2023; 11:e0428222. [PMID: 36744896 PMCID: PMC10101064 DOI: 10.1128/spectrum.04282-22] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 01/18/2023] [Indexed: 02/07/2023] Open
Abstract
In the last decade, metagenomics has greatly revolutionized the study of microbial communities. However, the presence of artificial duplicate reads raised mainly from the preparation of metagenomic DNA sequencing libraries and their impacts on metagenomic assembly and binning have never been brought to attention. Here, we explicitly investigated the effects of duplicate reads on metagenomic assemblies and binning based on analyses of five groups of representative metagenomes with distinct microbiome complexities. Our results showed that deduplication considerably increased the binning yields (by 3.5% to 80%) for most of the metagenomic data sets examined thanks to the improved contig length and coverage profiling of metagenome-assembled contigs, whereas it slightly decreased the binning yields of metagenomes with low complexity (e.g., human gut metagenomes). Specifically, 411 versus 397, 331 versus 317, 104 versus 88, and 9 versus 5 metagenome-assembled genomes (MAGs) were recovered from MEGAHIT assemblies of bioreactor sludge, surface water, lake sediment, and forest soil metagenomes, respectively. Noticeably, deduplication significantly reduced the computational costs of the metagenomic assembly, including the elapsed time (9.0% to 29.9%) and the maximum memory requirement (4.3% to 37.1%). Collectively, we recommend the removal of duplicate reads in metagenomes with high complexity before assembly and binning analyses, for example, the forest soil metagenomes examined in this study. IMPORTANCE Duplicated reads in shotgun metagenomes are usually considered technical artifacts. Their presence in metagenomes would theoretically not only introduce bias into the quantitative analysis but also result in mistakes in the coverage profile, leading to adverse effects on or even failures in metagenomic assembly and binning, as the widely used metagenome assemblers and binners all need coverage information for graph partitioning and assembly binning, respectively. However, this issue was seldom noticed, and its impacts on downstream essential bioinformatic procedures (e.g., assembly and binning) remained unclear. In this study, we comprehensively evaluated for the first time the implications of duplicate reads for the de novo assembly and binning of real metagenomic data sets by comparing the assembly qualities, binning yields, and requirements for computational resources with and without the removal of duplicate reads. It was revealed that deduplication considerably increased the binning yields of metagenomes with high complexity and significantly reduced the computational costs, including the elapsed time and the maximum memory requirement, for most of the metagenomes studied. These results provide empirical references for more cost-efficient metagenomic analyses in microbiome research.
Collapse
Affiliation(s)
- Zhiguo Zhang
- College of Environmental and Resources Sciences, Zhejiang University, Hangzhou, Zhejiang Province, China
- Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, Zhejiang Province, China
| | - Lu Zhang
- Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, Zhejiang Province, China
| | - Guoqing Zhang
- Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, Zhejiang Province, China
| | - Ze Zhao
- Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, Zhejiang Province, China
| | - Hui Wang
- Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, Zhejiang Province, China
| | - Feng Ju
- Research Center for Industries of the Future, Key Laboratory of Coastal Environment and Resources of Zhejiang Province, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Center of Synthetic Biology and Integrated Bioengineering, School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China
- Institute of Advanced Technology, Westlake Institute for Advanced Study, Hangzhou, Zhejiang Province, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang Province, China
| |
Collapse
|
2
|
Nguyen HN, Cao NPT, Van Nguyen TC, Le KND, Nguyen DT, Nguyen QTT, Nguyen THT, Van Nguyen C, Le HT, Nguyen MLT, Nguyen TV, Tran VU, Luong BA, Le LGH, Ho QC, Pham HAT, Vo BT, Nguyen LT, Dang ATH, Nguyen SD, Do DM, Do TTT, Hoang AV, Dinh KT, Phan MD, Giang H, Tran LS. Liquid biopsy uncovers distinct patterns of DNA methylation and copy number changes in NSCLC patients with different EGFR-TKI resistant mutations. Sci Rep 2021; 11:16436. [PMID: 34385540 PMCID: PMC8361064 DOI: 10.1038/s41598-021-95985-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 07/31/2021] [Indexed: 01/19/2023] Open
Abstract
Targeted therapy with tyrosine kinase inhibitors (TKI) provides survival benefits to a majority of patients with non-small cell lung cancer (NSCLC). However, resistance to TKI almost always develops after treatment. Although genetic and epigenetic alterations have each been shown to drive resistance to TKI in cell line models, clinical evidence for their contribution in the acquisition of resistance remains limited. Here, we employed liquid biopsy for simultaneous analysis of genetic and epigenetic changes in 122 Vietnamese NSCLC patients undergoing TKI therapy and displaying acquired resistance. We detected multiple profiles of resistance mutations in 51 patients (41.8%). Of those, genetic alterations in EGFR, particularly EGFR amplification (n = 6), showed pronounced genome instability and genome-wide hypomethylation. Interestingly, the level of hypomethylation was associated with the duration of response to TKI treatment. We also detected hypermethylation in regulatory regions of Homeobox genes which are known to be involved in tumor differentiation. In contrast, such changes were not observed in cases with MET (n = 4) and HER2 (n = 4) amplification. Thus, our study showed that liquid biopsy could provide important insights into the heterogeneity of TKI resistance mechanisms in NSCLC patients, providing essential information for prediction of resistance and selection of subsequent treatment.
Collapse
Affiliation(s)
- Hoai-Nghia Nguyen
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam.
| | | | | | | | | | | | | | | | - Ha Thu Le
- Ha Noi Oncology Hospital, Ha Noi, Vietnam
| | | | | | - Vu Uyen Tran
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Bac An Luong
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Linh Gia Hoang Le
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Quoc Chuong Ho
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | | | - Binh Thanh Vo
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | | | - Anh-Thu Huynh Dang
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | | | - Duc Minh Do
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | | | - Anh Vu Hoang
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | | | - Minh-Duy Phan
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Hoa Giang
- Medical Genetics Institute, Ho Chi Minh City, Vietnam.
| | - Le Son Tran
- Medical Genetics Institute, Ho Chi Minh City, Vietnam.
| |
Collapse
|
3
|
Liu Y, Zhang X, Zou Q, Zeng X. Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics 2021; 37:1604-1606. [PMID: 33112385 DOI: 10.1093/bioinformatics/btaa915] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 09/30/2020] [Accepted: 10/14/2020] [Indexed: 12/21/2022] Open
Abstract
SUMMARY Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minirmd. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China
| | - Xiaocai Zhang
- Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China
| |
Collapse
|
4
|
Dai H, Guan Y. Nubeam-dedup: a fast and RAM-efficient tool to de-duplicate sequencing reads without mapping. Bioinformatics 2020; 36:3254-3256. [PMID: 32091581 DOI: 10.1093/bioinformatics/btaa112] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 02/06/2020] [Accepted: 02/14/2020] [Indexed: 12/15/2022] Open
Abstract
SUMMARY We present Nubeam-dedup, a fast and RAM-efficient tool to de-duplicate sequencing reads without reference genome. Nubeam-dedup represents nucleotides by matrices, transforms reads into products of matrices, and based on which assigns a unique number to a read. Thus, duplicate reads can be efficiently removed by using a collisionless hash function. Compared with other state-of-the-art reference-free tools, Nubeam-dedup uses 50-70% of CPU time and 10-15% of RAM. AVAILABILITY AND IMPLEMENTATION Source code in C++ and manual are available at https://github.com/daihang16/nubeamdedup and https://haplotype.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hang Dai
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27705, USA
| | - Yongtao Guan
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27705, USA
| |
Collapse
|
5
|
Tran LS, Nguyen QTT, Nguyen CV, Tran VU, Nguyen THT, Le HT, Nguyen MLT, Le VT, Pham LS, Vo BT, Dang ATH, Nguyen LT, Nguyen TCV, Pham HAT, Tran TT, Nguyen LH, Nguyen TTT, Nguyen KHT, Vu YV, Nguyen NH, Bui VQ, Bui HH, Do TTT, Lam NV, Truong Dinh K, Phan MD, Nguyen HN, Giang H. Ultra-Deep Massive Parallel Sequencing of Plasma Cell-Free DNA Enables Large-Scale Profiling of Driver Mutations in Vietnamese Patients With Advanced Non-Small Cell Lung Cancer. Front Oncol 2020; 10:1351. [PMID: 32850431 PMCID: PMC7418519 DOI: 10.3389/fonc.2020.01351] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2020] [Accepted: 06/26/2020] [Indexed: 01/15/2023] Open
Abstract
Population-specific profiling of mutations in cancer genes is of critical importance for the understanding of cancer biology in general as well as the establishment of optimal diagnostics and treatment guidelines for that particular population. Although genetic analysis of tumor tissue is often used to detect mutations in cancer genes, the invasiveness and limited accessibility hinders its application in large-scale population studies. Here, we used ultra-deep massive parallel sequencing of plasma cell free DNA (cfDNA) to identify the mutation profiles of 265 Vietnamese patients with advanced non-small cell lung cancer (NSCLC). Compared to a cohort of advanced NSCLC patients characterized by sequencing of tissue samples, cfDNA genomic testing, despite lower mutation detection rates, was able to detect major mutations in tested driver genes that reflected similar mutation composition and distribution pattern, as well as major associations between mutation prevalence and clinical features. In conclusion, ultra-deep sequencing of plasma cfDNA represents an alternative approach for population-wide genetic profiling of cancer genes where recruitment of patients is limited to the accessibility of tumor tissue site.
Collapse
Affiliation(s)
| | | | | | | | | | - Ha Thu Le
- Ha Noi Oncology Hospital, Hanoi, Vietnam
| | | | | | - Lam-Son Pham
- Vietnam National Cancer Hospital, Hanoi, Vietnam
| | | | - Anh-Thu Huynh Dang
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | | | | | | | | | | | | | | | - Yen-Vi Vu
- Gene Solutions, Ho Chi Minh City, Vietnam
| | | | | | | | | | - Nien Vinh Lam
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | | | | | - Hoai-Nghia Nguyen
- University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Hoa Giang
- Gene Solutions, Ho Chi Minh City, Vietnam
| |
Collapse
|
6
|
Nguyen HT, Tran DH, Ngo QD, Pham HAT, Tran TT, Tran VU, Pham TVN, Le TK, Le NAT, Nguyen NM, Vo BT, Nguyen LT, Nguyen TCV, Bui QTN, Nguyen HN, Luong BA, Le LGH, Do DM, Do TTT, Hoang AV, Dinh KT, Phan MD, Tran LS, Giang H, Nguyen HN. Evaluation of a Liquid Biopsy Protocol using Ultra-Deep Massive Parallel Sequencing for Detecting and Quantifying Circulation Tumor DNA in Colorectal Cancer Patients. Cancer Invest 2020; 38:85-93. [DOI: 10.1080/07357907.2020.1713350] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Affiliation(s)
| | - Duc Huy Tran
- University Medical Center, Ho Chi Minh City, Vietnam
| | - Quoc Dat Ngo
- University of Medicine and Pharmacy, Ho Chi Minh City, Vietnam
| | - Hong-Anh Thi Pham
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Thanh-Truong Tran
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Vu-Uyen Tran
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | | | - Trung Kien Le
- University Medical Center, Ho Chi Minh City, Vietnam
| | | | - Ngoc Mai Nguyen
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Binh Thanh Vo
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Luan Thanh Nguyen
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Thien-Chi Van Nguyen
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Quynh Tram Nguyen Bui
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Huu-Nguyen Nguyen
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Bac An Luong
- University of Medicine and Pharmacy, Ho Chi Minh City, Vietnam
| | | | - Duc Minh Do
- University of Medicine and Pharmacy, Ho Chi Minh City, Vietnam
| | - Thanh-Thuy Thi Do
- University of Medicine and Pharmacy, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Anh Vu Hoang
- University of Medicine and Pharmacy, Ho Chi Minh City, Vietnam
| | | | - Minh-Duy Phan
- Gene Solutions, Ho Chi Minh City, Vietnam
- School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, Australia
| | | | - Hoa Giang
- Gene Solutions, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | | |
Collapse
|
7
|
NGSReadsTreatment - A Cuckoo Filter-based Tool for Removing Duplicate Reads in NGS Data. Sci Rep 2019; 9:11681. [PMID: 31406180 PMCID: PMC6690869 DOI: 10.1038/s41598-019-48242-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 08/01/2019] [Indexed: 11/24/2022] Open
Abstract
The Next-Generation Sequencing (NGS) platforms provide a major approach to obtaining millions of short reads from samples. NGS has been used in a wide range of analyses, such as for determining genome sequences, analyzing evolutionary processes, identifying gene expression and resolving metagenomic analyses. Usually, the quality of NGS data impacts the final study conclusions. Moreover, quality assessment is generally considered the first step in data analyses to ensure the use of only reliable reads for further studies. In NGS platforms, the presence of duplicated reads (redundancy) that are usually introduced during library sequencing is a major issue. These might have a serious impact on research application, as redundancies in reads can lead to difficulties in subsequent analysis (e.g., de novo genome assembly). Herein, we present NGSReadsTreatment, a computational tool for the removal of duplicated reads in paired-end or single-end datasets. NGSReadsTreatment can handle reads from any platform with the same or different sequence lengths. Using the probabilistic structure Cuckoo Filter, the redundant reads are identified and removed by comparing the reads with themselves. Thus, no prerequisite is required beyond the set of reads. NGSReadsTreatment was compared with other redundancy removal tools in analyzing different sets of reads. The results demonstrated that NGSReadsTreatment was better than the other tools in both the amount of redundancies removed and the use of computational memory for all analyses performed. Available in https://sourceforge.net/projects/ngsreadstreatment/.
Collapse
|
8
|
Identification of factors associated with duplicate rate in ChIP-seq data. PLoS One 2019; 14:e0214723. [PMID: 30943272 PMCID: PMC6447195 DOI: 10.1371/journal.pone.0214723] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2018] [Accepted: 03/19/2019] [Indexed: 12/20/2022] Open
Abstract
Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.
Collapse
|
9
|
Martin TC, Visconti A, Spector TD, Falchi M. Conducting metagenomic studies in microbiology and clinical research. Appl Microbiol Biotechnol 2018; 102:8629-8646. [PMID: 30078138 PMCID: PMC6153607 DOI: 10.1007/s00253-018-9209-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2018] [Revised: 06/28/2018] [Accepted: 06/28/2018] [Indexed: 12/11/2022]
Abstract
Owing to the increased cost-effectiveness of high-throughput technologies, the number of studies focusing on the human microbiome and its connections to human health and disease has recently surged. However, best practices in microbiology and clinical research have yet to be clearly established. Here, we present an overview of the challenges and opportunities involved in conducting a metagenomic study, with a particular focus on data processing and analytical methods.
Collapse
Affiliation(s)
- Tiphaine C. Martin
- Department of Twin Research and Genetic Epidemiology, King’s College London, London, UK
- Department of Oncological Sciences, Icahn School of Medicine at Mount Sinai, New York, NY USA
| | - Alessia Visconti
- Department of Twin Research and Genetic Epidemiology, King’s College London, London, UK
| | - Tim D. Spector
- Department of Twin Research and Genetic Epidemiology, King’s College London, London, UK
| | - Mario Falchi
- Department of Twin Research and Genetic Epidemiology, King’s College London, London, UK
| |
Collapse
|
10
|
Clement K, Farouni R, Bauer DE, Pinello L. AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing. Bioinformatics 2018; 34:i202-i210. [PMID: 29949956 PMCID: PMC6022702 DOI: 10.1093/bioinformatics/bty264] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Motivation Unique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon-based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments. Results Based on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis. Availability and implementation AmpUMI is open-source and freely available at http://github.com/pinellolab/AmpUMI.
Collapse
Affiliation(s)
- Kendell Clement
- Molecular Pathology Unit and Cancer Center, Massachusetts General Hospital, Boston, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Rick Farouni
- Molecular Pathology Unit and Cancer Center, Massachusetts General Hospital, Boston, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Daniel E Bauer
- Division of Hematology/Oncology, Boston Children's Hospital; Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
- Harvard Stem Cell Institute, Cambridge, MA, USA
| | - Luca Pinello
- Molecular Pathology Unit and Cancer Center, Massachusetts General Hospital, Boston, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
11
|
Corsetti PP, de Almeida LA, Gonçalves ANA, Gomes MTR, Guimarães ES, Marques JT, Oliveira SC. miR-181a-5p Regulates TNF-α and miR-21a-5p Influences Gualynate-Binding Protein 5 and IL-10 Expression in Macrophages Affecting Host Control of Brucella abortus Infection. Front Immunol 2018; 9:1331. [PMID: 29942317 PMCID: PMC6004377 DOI: 10.3389/fimmu.2018.01331] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Accepted: 05/29/2018] [Indexed: 12/13/2022] Open
Abstract
Brucella abortus is a Gram-negative intracellular bacterium that causes a worldwide zoonosis termed brucellosis, which is characterized as a debilitating infection with serious clinical manifestations leading to severe complications. In spite of great advances in studies involving host–B. abortus interactions, there are many gaps related to B. abortus modulation of the host immune response through regulatory mechanisms. Here, we deep sequenced small RNAs from bone marrow-derived macrophages infected with B. abortus, identifying 69 microRNAs (miRNAs) that were differentially expressed during infection. We further validated the expression of four upregulated and five downregulated miRNAs during infection in vitro that displayed the same profile in spleens from infected mice at 1, 3, or 6 days post-infection. Among these miRNAs, mmu-miR-181a-5p (upregulated) or mmu-miR-21a-5p (downregulated) were selected for further analysis. First, we determined that changes in the expression of both miRNAs induced by infection were dependent on the adaptor molecule MyD88. Furthermore, evaluating putative targets of mmu-miR-181a-5p, we demonstrated this miRNA negatively regulates TNF-α expression following Brucella infection. By contrast, miR-21a-5p targets included a negative regulator of IL-10, programmed cell death protein 4, and several guanylate-binding proteins (GBPs). As a result, during infection, miR-21a-5p led to upregulation of IL-10 expression and downregulation of GBP5 in macrophages infected with Brucella. Since GBP5 and IL-10 are important molecules involved in host control of Brucella infection, we decided to investigate the role of mmu-miR-21a-5p in bacterial replication in macrophages. We observed that treating macrophages with a mmu-miR-21a-5p mimic enhanced bacterial growth, whereas transfection of its inhibitor reduced Brucella load in macrophages. Taken together, the results indicate that downregulation of mmu-miR-21a-5p induced by infection increases GBP5 levels and decreases IL-10 expression thus contributing to bacterial control in host cells.
Collapse
Affiliation(s)
- Patrícia P Corsetti
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.,Departmento de Microbiologia e Imunologia, Universidade Federal de Alfenas, Alfenas, Brazil
| | - Leonardo A de Almeida
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.,Departmento de Microbiologia e Imunologia, Universidade Federal de Alfenas, Alfenas, Brazil
| | - André Nicolau Aquime Gonçalves
- Laboratorio de Sorologia, Microbiologia e Biologia Molecular, Universidade Federal de Santa Catarina, Florianópolis, Brazil
| | - Marco Túlio R Gomes
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Erika S Guimarães
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - João T Marques
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Sergio C Oliveira
- Departamento de Bioquímica e Imunologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.,Instituto Nacional de Ciência e Tecnologia em Doenças Tropicais (INCT-DT), Conselho Nacional de Desenvolvimento Científico e Tecnológico, Ministério de Ciência Tecnologia e Inovação Salvador, Salvador, Brazil
| |
Collapse
|
12
|
Klepikova AV, Kasianov AS, Chesnokov MS, Lazarevich NL, Penin AA, Logacheva M. Effect of method of deduplication on estimation of differential gene expression using RNA-seq. PeerJ 2017; 5:e3091. [PMID: 28321364 PMCID: PMC5357343 DOI: 10.7717/peerj.3091] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Accepted: 02/14/2017] [Indexed: 12/11/2022] Open
Abstract
Background RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. Results To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. Conclusion The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.
Collapse
Affiliation(s)
- Anna V Klepikova
- Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.,A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Artem S Kasianov
- A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.,N. I. Vavilov Institute for General Genetics, Moscow, Russia
| | - Mikhail S Chesnokov
- N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia
| | - Natalia L Lazarevich
- N.N. Blokhin Russian Cancer Research Center of the Ministry of Health of the Russian Federation, Moscow, Russia.,Department of Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Aleksey A Penin
- Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.,A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.,Department of Biology, Lomonosov Moscow State University, Moscow, Russia
| | - Maria Logacheva
- Institute for Information Transmission Problems of the Russian Academy of Sciences, Moscow, Russia.,A. N. Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow, Russia.,Extreme Biology Laboratory, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan
| |
Collapse
|
13
|
Manconi A, Moscatelli M, Armano G, Gnocchi M, Orro A, Milanesi L. Removing duplicate reads using graphics processing units. BMC Bioinformatics 2016; 17:346. [PMID: 28185553 PMCID: PMC5123249 DOI: 10.1186/s12859-016-1192-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Background During library construction polymerase chain reaction is used to enrich the DNA before sequencing. Typically, this process generates duplicate read sequences. Removal of these artifacts is mandatory, as they can affect the correct interpretation of data in several analyses. Ideally, duplicate reads should be characterized by identical nucleotide sequences. However, due to sequencing errors, duplicates may also be nearly-identical. Removing nearly-identical duplicates can result in a notable computational effort. To deal with this challenge, we recently proposed a GPU method aimed at removing identical and nearly-identical duplicates generated with an Illumina platform. The method implements an approach based on prefix-suffix comparison. Read sequences with identical prefix are considered potential duplicates. Then, their suffixes are compared to identify and remove those that are actually duplicated. Although the method can be efficiently used to remove duplicates, there are some limitations that need to be overcome. In particular, it cannot to detect potential duplicates in the event that prefixes are longer than 27 bases, and it does not provide support for paired-end read libraries. Moreover, large clusters of potential duplicates are split into smaller with the aim to guarantees a reasonable computing time. This heuristic may affect the accuracy of the analysis. Results In this work we propose GPU-DupRemoval, a new implementation of our method able to (i) cluster reads without constraints on the maximum length of the prefixes, (ii) support both single- and paired-end read libraries, and (iii) analyze large clusters of potential duplicates. Conclusions Due to the massive parallelization obtained by exploiting graphics cards, GPU-DupRemoval removes duplicate reads faster than other cutting-edge solutions, while outperforming most of them in terms of amount of duplicates reads.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy.
| | - Marco Moscatelli
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| | - Giuliano Armano
- Department of Electrical and Electronic Engineering, University of Cagliari, P.zza D'Armi, Cagliari (CA), 09123, Italy
| | - Matteo Gnocchi
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| | - Alessandro Orro
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| | - Luciano Milanesi
- Institute for Biomedical Technologies, National Research Council, Via Fratelli Cervi, 93, Segrate (Mi), 20090, Italy
| |
Collapse
|
14
|
Dunning LT, Hipperson H, Baker WJ, Butlin RK, Devaux C, Hutton I, Igea J, Papadopulos AST, Quan X, Smadja CM, Turnbull CGN, Savolainen V. Ecological speciation in sympatric palms: 1. Gene expression, selection and pleiotropy. J Evol Biol 2016; 29:1472-87. [PMID: 27177130 PMCID: PMC6680112 DOI: 10.1111/jeb.12895] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 05/04/2016] [Accepted: 05/11/2016] [Indexed: 02/02/2023]
Abstract
Ecological speciation requires divergent selection, reproductive isolation and a genetic mechanism to link the two. We examined the role of gene expression and coding sequence evolution in this process using two species of Howea palms that have diverged sympatrically on Lord Howe Island, Australia. These palms are associated with distinct soil types and have displaced flowering times, representing an ideal candidate for ecological speciation. We generated large amounts of RNA‐Seq data from multiple individuals and tissue types collected on the island from each of the two species. We found that differentially expressed loci as well as those with divergent coding sequences between Howea species were associated with known ecological and phenotypic differences, including response to salinity, drought, pH and flowering time. From these loci, we identified potential ‘ecological speciation genes’ and further validate their effect on flowering time by knocking out orthologous loci in a model plant species. Finally, we put forward six plausible ecological speciation loci, providing support for the hypothesis that pleiotropy could help to overcome the antagonism between selection and recombination during speciation with gene flow.
Collapse
Affiliation(s)
- L T Dunning
- Department of Life Sciences, Imperial College London, Ascot, UK
| | - H Hipperson
- Department of Life Sciences, Imperial College London, Ascot, UK
| | - W J Baker
- Royal Botanic Gardens, Kew, Richmond, UK
| | - R K Butlin
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, UK.,Sven Lovén Centre for Marine Sciences, Tjärnö, University of Gothenburg, Stromstäd, Sweden
| | - C Devaux
- Department of Life Sciences, Imperial College London, Ascot, UK
| | - I Hutton
- Lord Howe Island Museum, Lord Howe Island, NSW, Australia
| | - J Igea
- Department of Life Sciences, Imperial College London, Ascot, UK
| | - A S T Papadopulos
- Department of Life Sciences, Imperial College London, Ascot, UK.,Royal Botanic Gardens, Kew, Richmond, UK
| | - X Quan
- Department of Life Sciences, Imperial College London, Ascot, UK
| | - C M Smadja
- Department of Life Sciences, Imperial College London, Ascot, UK
| | - C G N Turnbull
- Department of Life Sciences, Imperial College London, London, UK
| | - V Savolainen
- Department of Life Sciences, Imperial College London, Ascot, UK.,Royal Botanic Gardens, Kew, Richmond, UK
| |
Collapse
|
15
|
González-Domínguez J, Schmidt B. ParDRe: faster parallel duplicated reads removal tool for sequencing studies: Table 1. Bioinformatics 2016; 32:1562-4. [DOI: 10.1093/bioinformatics/btw038] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 01/17/2016] [Indexed: 11/14/2022] Open
|
16
|
Brand P, Ramírez SR, Leese F, Quezada-Euan JJG, Tollrian R, Eltz T. Rapid evolution of chemosensory receptor genes in a pair of sibling species of orchid bees (Apidae: Euglossini). BMC Evol Biol 2015; 15:176. [PMID: 26314297 PMCID: PMC4552289 DOI: 10.1186/s12862-015-0451-9] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2015] [Accepted: 08/10/2015] [Indexed: 12/13/2022] Open
Abstract
Background Insects rely more on chemical signals (semiochemicals) than on any other sensory modality to find, identify, and choose mates. In most insects, pheromone production is typically regulated through biosynthetic pathways, whereas pheromone sensory detection is controlled by the olfactory system. Orchid bees are exceptional in that their semiochemicals are not produced metabolically, but instead male bees collect odoriferous compounds (perfumes) from the environment and store them in specialized hind-leg pockets to subsequently expose during courtship display. Thus, the olfactory sensory system of orchid bees simultaneously controls male perfume traits (sender components) and female preferences (receiver components). This functional linkage increases the opportunities for parallel evolution of male traits and female preferences, particularly in response to genetic changes of chemosensory detection (e.g. Odorant Receptor genes). To identify whether shifts in pheromone composition among related lineages of orchid bees are associated with divergence in chemosensory genes of the olfactory periphery, we searched for patterns of divergent selection across the antennal transcriptomes of two recently diverged sibling species Euglossa dilemma and E. viridissima. Results We identified 3185 orthologous genes including 94 chemosensory loci from five different gene families (Odorant Receptors, Ionotropic Receptors, Gustatory Receptors, Odorant Binding Proteins, and Chemosensory Proteins). Our results revealed that orthologs with signatures of divergent selection between E. dilemma and E. viridissima were significantly enriched for chemosensory genes. Notably, elevated signals of divergent selection were almost exclusively observed among chemosensory receptors (i.e. Odorant Receptors). Conclusions Our results suggest that rapid changes in the chemosensory gene family occurred among closely related species of orchid bees. These findings are consistent with the hypothesis that strong divergent selection acting on chemosensory receptor genes plays an important role in the evolution and diversification of insect pheromone systems. Electronic supplementary material The online version of this article (doi:10.1186/s12862-015-0451-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Philipp Brand
- Department of Animal Ecology, Evolution and Biodiversity, Ruhr University Bochum, Universitätsstrasse 150, D-44801, Bochum, Germany. .,Department for Evolution and Ecology, Center for Population Biology, University of California Davis, One Shields Avenue, 95616, Davis, USA.
| | - Santiago R Ramírez
- Department for Evolution and Ecology, Center for Population Biology, University of California Davis, One Shields Avenue, 95616, Davis, USA.
| | - Florian Leese
- Department of Animal Ecology, Evolution and Biodiversity, Ruhr University Bochum, Universitätsstrasse 150, D-44801, Bochum, Germany. .,Present address: Faculty of Biology, Aquatic Ecosystems Research, University of Duisburg and Essen, Universitätsstrasse 5, D-45141, Essen, Germany.
| | | | - Ralph Tollrian
- Department of Animal Ecology, Evolution and Biodiversity, Ruhr University Bochum, Universitätsstrasse 150, D-44801, Bochum, Germany.
| | - Thomas Eltz
- Department of Animal Ecology, Evolution and Biodiversity, Ruhr University Bochum, Universitätsstrasse 150, D-44801, Bochum, Germany.
| |
Collapse
|
17
|
Draft Genome Sequence of the Archiascomycetous Yeast Saitoella complicata. GENOME ANNOUNCEMENTS 2015; 3:3/3/e00220-15. [PMID: 26021914 PMCID: PMC4447899 DOI: 10.1128/genomea.00220-15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The draft genome sequence of the archiasomycetous yeast Saitoella complicata was determined. The assembly of newly and previously sequenced data sets resulted in 104 contigs (total of 14.1 Mbp; N50, 239 kbp). On the newly assembled genome, a total of 6,933 protein-coding sequences (7,119 transcripts, including alternative splicing forms) were identified.
Collapse
|
18
|
Nishida H, Matsumoto T, Kondo S, Hamamoto M, Yoshikawa H. The early diverging ascomycetous budding yeast Saitoella complicata has three histone deacetylases belonging to the Clr6, Hos2, and Rpd3 lineages. J GEN APPL MICROBIOL 2015; 60:7-12. [PMID: 24646756 DOI: 10.2323/jgam.60.7] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Abstract
We sequenced the genomic DNA and the transcribed RNA of the ascomycetous budding yeast Saitoella complicata, which belongs to the earliest lineage (Taphrinomycotina) of ascomycetes. We found 3 protein-coding regions similar to Clr6 of Schizosaccharomyces (a member of Taphrinomycotina). Clr6 has a structure similar to that of Rpd3 and Hos2 of Saccharomyces. These proteins belong to the class 1 histone deacetylase (HDAC) family. The phylogenetic tree showed that the Clr6, Hos2, and Rpd3 lineages are separated in fungal HDACs. Basidiomycetes have 3 proteins belonging to the Clr6, Hos2, and Rpd3 lineages. On the other hand, whereas ascomycetes except for Schizosaccharomyces have the Hos2 and Rpd3 homologs, and lack the Clr6 homolog, Schizosaccharomyces has the Clr6 and Hos2 homologs, and lacks the Rpd3 homolog. Interestingly, Pneumocystis and Saitoella have 3 proteins belonging to the Clr6, Hos2, and Rpd3 lineages. Thus, these fungi are the first ascomycete found to possess all 3 types. Our findings indicated that Taphrinomycotina has conserved the Clr6 protein, suggesting that the ancestor of Dikarya (ascomycetes and basidiomycetes) had 3 proteins belonging to the Clr6, Hos2, and Rpd3 lineages. During ascomycete evolution, Pezizomycotina and Saccharomycotina appear to have lost their Clr6 homologs and Schizosaccharomyces to have lost its Rpd3 homolog.
Collapse
Affiliation(s)
- Hiromi Nishida
- Biotechnology Research Center and Department of Biotechnology, Toyama Prefectural University
| | | | | | | | | |
Collapse
|
19
|
Manconi A, Manca E, Moscatelli M, Gnocchi M, Orro A, Armano G, Milanesi L. G-CNV: A GPU-Based Tool for Preparing Data to Detect CNVs with Read-Depth Methods. Front Bioeng Biotechnol 2015; 3:28. [PMID: 25806367 PMCID: PMC4354384 DOI: 10.3389/fbioe.2015.00028] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 02/19/2015] [Indexed: 11/23/2022] Open
Abstract
Copy number variations (CNVs) are the most prevalent types of structural variations (SVs) in the human genome and are involved in a wide range of common human diseases. Different computational methods have been devised to detect this type of SVs and to study how they are implicated in human diseases. Recently, computational methods based on high-throughput sequencing (HTS) are increasingly used. The majority of these methods focus on mapping short-read sequences generated from a donor against a reference genome to detect signatures distinctive of CNVs. In particular, read-depth based methods detect CNVs by analyzing genomic regions with significantly different read-depth from the other ones. The pipeline analysis of these methods consists of four main stages: (i) data preparation, (ii) data normalization, (iii) CNV regions identification, and (iv) copy number estimation. However, available tools do not support most of the operations required at the first two stages of this pipeline. Typically, they start the analysis by building the read-depth signal from pre-processed alignments. Therefore, third-party tools must be used to perform most of the preliminary operations required to build the read-depth signal. These data-intensive operations can be efficiently parallelized on graphics processing units (GPUs). In this article, we present G-CNV, a GPU-based tool devised to perform the common operations required at the first two stages of the analysis pipeline. G-CNV is able to filter low-quality read sequences, to mask low-quality nucleotides, to remove adapter sequences, to remove duplicated read sequences, to map the short-reads, to resolve multiple mapping ambiguities, to build the read-depth signal, and to normalize it. G-CNV can be efficiently used as a third-party tool able to prepare data for the subsequent read-depth signal generation and analysis. Moreover, it can also be integrated in CNV detection tools to generate read-depth signals.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Emanuele Manca
- Department of Electrical and Electronic Engineering, University of Cagliari , Cagliari , Italy
| | - Marco Moscatelli
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Matteo Gnocchi
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Alessandro Orro
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| | - Giuliano Armano
- Department of Electrical and Electronic Engineering, University of Cagliari , Cagliari , Italy
| | - Luciano Milanesi
- Institute for Biomedical Technologies, National Research Council , Milan , Italy
| |
Collapse
|
20
|
Kingsford C, Patro R. Reference-based compression of short-read sequences using path encoding. Bioinformatics 2015; 31:1920-8. [PMID: 25649622 PMCID: PMC4481695 DOI: 10.1093/bioinformatics/btv071] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 01/29/2015] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed. RESULTS We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved. AVAILABILITY AND IMPLEMENTATION Source code and binaries freely available for download at http://www.cs.cmu.edu/∼ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.
Collapse
Affiliation(s)
- Carl Kingsford
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| | - Rob Patro
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA and Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA
| |
Collapse
|
21
|
Zhou X, Rokas A. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies. Mol Ecol 2014; 23:1679-700. [DOI: 10.1111/mec.12680] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2013] [Revised: 01/17/2014] [Accepted: 01/22/2014] [Indexed: 12/17/2022]
Affiliation(s)
- Xiaofan Zhou
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| | - Antonis Rokas
- Department of Biological Sciences; Vanderbilt University; Nashville TN 37235 USA
| |
Collapse
|
22
|
Zhou S, Liao R, Guan J. When cloud computing meets bioinformatics: a review. J Bioinform Comput Biol 2013; 11:1330002. [PMID: 24131049 DOI: 10.1142/s0219720013300025] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
In the past decades, with the rapid development of high-throughput technologies, biology research has generated an unprecedented amount of data. In order to store and process such a great amount of data, cloud computing and MapReduce were applied to many fields of bioinformatics. In this paper, we first introduce the basic concepts of cloud computing and MapReduce, and their applications in bioinformatics. We then highlight some problems challenging the applications of cloud computing and MapReduce to bioinformatics. Finally, we give a brief guideline for using cloud computing in biology research.
Collapse
Affiliation(s)
- Shuigeng Zhou
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, P. R. China
| | | | | |
Collapse
|
23
|
Xu H, Luo X, Qian J, Pang X, Song J, Qian G, Chen J, Chen S. FastUniq: a fast de novo duplicates removal tool for paired short reads. PLoS One 2012; 7:e52249. [PMID: 23284954 PMCID: PMC3527383 DOI: 10.1371/journal.pone.0052249] [Citation(s) in RCA: 388] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2012] [Accepted: 11/16/2012] [Indexed: 11/19/2022] Open
Abstract
The presence of duplicates introduced by PCR amplification is a major issue in paired short reads from next-generation sequencing platforms. These duplicates might have a serious impact on research applications, such as scaffolding in whole-genome sequencing and discovering large-scale genome variations, and are usually removed. We present FastUniq as a fast de novo tool for removal of duplicates in paired short reads. FastUniq identifies duplicates by comparing sequences between read pairs and does not require complete genome sequences as prerequisites. FastUniq is capable of simultaneously handling reads with different lengths and results in highly efficient running time, which increases linearly at an average speed of 87 million reads per 10 minutes. FastUniq is freely available at http://sourceforge.net/projects/fastuniq/.
Collapse
Affiliation(s)
- Haibin Xu
- The National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People’s Republic of China
| | - Xiang Luo
- The National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People’s Republic of China
| | - Jun Qian
- The National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People’s Republic of China
| | - Xiaohui Pang
- The National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People’s Republic of China
| | - Jingyuan Song
- The National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People’s Republic of China
| | - Guangrui Qian
- Department of Geosciences, Stony Brook University, Stony Brook, New York, United States of America
| | - Jinhui Chen
- Key Laboratory of Forest Genetics and Biotechnology, Ministry of Education of China, Nanjing Forestry University, Nanjing, Jiangsu Province, China
- * E-mail: (JHC); (SLC)
| | - Shilin Chen
- The National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People’s Republic of China
- * E-mail: (JHC); (SLC)
| |
Collapse
|
24
|
Veeneman BA, Iyer MK, Chinnaiyan AM. Oculus: faster sequence alignment by streaming read compression. BMC Bioinformatics 2012; 13:297. [PMID: 23148484 PMCID: PMC3534618 DOI: 10.1186/1471-2105-13-297] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2012] [Accepted: 11/01/2012] [Indexed: 01/17/2023] Open
Abstract
Background Despite significant advancement in alignment algorithms, the exponential growth of nucleotide sequencing throughput threatens to outpace bioinformatic analysis. Computation may become the bottleneck of genome analysis if growing alignment costs are not mitigated by further improvement in algorithms. Much gain has been gleaned from indexing and compressing alignment databases, but many widely used alignment tools process input reads sequentially and are oblivious to any underlying redundancy in the reads themselves. Results Here we present Oculus, a software package that attaches to standard aligners and exploits read redundancy by performing streaming compression, alignment, and decompression of input sequences. This nearly lossless process (> 99.9%) led to alignment speedups of up to 270% across a variety of data sets, while requiring a modest amount of memory. We expect that streaming read compressors such as Oculus could become a standard addition to existing RNA-Seq and ChIP-Seq alignment pipelines, and potentially other applications in the future as throughput increases. Conclusions Oculus efficiently condenses redundant input reads and wraps existing aligners to provide nearly identical SAM output in a fraction of the aligner runtime. It includes a number of useful features, such as tunable performance and fidelity options, compatibility with FASTA or FASTQ files, and adherence to the SAM format. The platform-independent C++ source code is freely available online, at http://code.google.com/p/oculus-bio.
Collapse
Affiliation(s)
- Brendan A Veeneman
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| | | | | |
Collapse
|
25
|
Lehnert EM, Burriesci MS, Pringle JR. Developing the anemone Aiptasia as a tractable model for cnidarian-dinoflagellate symbiosis: the transcriptome of aposymbiotic A. pallida. BMC Genomics 2012; 13:271. [PMID: 22726260 PMCID: PMC3427133 DOI: 10.1186/1471-2164-13-271] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2012] [Accepted: 06/22/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Coral reefs are hotspots of oceanic biodiversity, forming the foundation of ecosystems that are important both ecologically and for their direct practical impacts on humans. Corals are declining globally due to a number of stressors, including rising sea-surface temperatures and pollution; such stresses can lead to a breakdown of the essential symbiotic relationship between the coral host and its endosymbiotic dinoflagellates, a process known as coral bleaching. Although the environmental stresses causing this breakdown are largely known, the cellular mechanisms of symbiosis establishment, maintenance, and breakdown are still largely obscure. Investigating the symbiosis using an experimentally tractable model organism, such as the small sea anemone Aiptasia, should improve our understanding of exactly how the environmental stressors affect coral survival and growth. RESULTS We assembled the transcriptome of a clonal population of adult, aposymbiotic (dinoflagellate-free) Aiptasia pallida from ~208 million reads, yielding 58,018 contigs. We demonstrated that many of these contigs represent full-length or near-full-length transcripts that encode proteins similar to those from a diverse array of pathways in other organisms, including various metabolic enzymes, cytoskeletal proteins, and neuropeptide precursors. The contigs were annotated by sequence similarity, assigned GO terms, and scanned for conserved protein domains. We analyzed the frequency and types of single-nucleotide variants and estimated the size of the Aiptasia genome to be ~421 Mb. The contigs and annotations are available through NCBI (Transcription Shotgun Assembly database, accession numbers JV077153-JV134524) and at http://pringlelab.stanford.edu/projects.html. CONCLUSIONS The availability of an extensive transcriptome assembly for A. pallida will facilitate analyses of gene-expression changes, identification of proteins of interest, and other studies in this important emerging model system.
Collapse
Affiliation(s)
- Erik M Lehnert
- Department of Genetics, Stanford University School of Medicine, CA 94025, USA.
| | | | | |
Collapse
|