1
|
Warren RL, Abraham R, Calingo M, Garant JM, Jones SJM, Birol I. Establishing association between HLA-C*04:01 and severe COVID-19. HLA 2024; 103:e15355. [PMID: 38273454 DOI: 10.1111/tan.15355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/05/2024] [Accepted: 01/09/2024] [Indexed: 01/27/2024]
Affiliation(s)
- René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Rohan Abraham
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Marc Calingo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Jean-Michel Garant
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
- Canadian Centre for Computational Genomics, McGill University, Montréal, Québec, Canada
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| |
Collapse
|
2
|
Lo T, Coombe L, Gagalova KK, Marr A, Warren RL, Kirk H, Pandoh P, Zhao Y, Moore RA, Mungall AJ, Ritland C, Pavy N, Jones SJM, Bohlmann J, Bousquet J, Birol I, Thomson A. Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response. G3 (Bethesda) 2023; 14:jkad247. [PMID: 37875130 PMCID: PMC10755193 DOI: 10.1093/g3journal/jkad247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 05/17/2023] [Accepted: 10/09/2023] [Indexed: 10/26/2023]
Abstract
Black spruce (Picea mariana [Mill.] B.S.P.) is a dominant conifer species in the North American boreal forest that plays important ecological and economic roles. Here, we present the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. We analyzed the evolutionary relationships between P. mariana and 5 other spruces for which complete nuclear and organelle genome sequences were available. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and 3 other taxa found in western North America, followed by the European Picea abies. We obtained mixed topologies with weaker statistical support in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these 2 genomes. Clustering of protein-coding sequences from the 6 Picea taxa and 2 Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups identified gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.
Collapse
Affiliation(s)
- Theodora Lo
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Kristina K Gagalova
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Alex Marr
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Heather Kirk
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Pawan Pandoh
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Yongjun Zhao
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Richard A Moore
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Andrew J Mungall
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Carol Ritland
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Nathalie Pavy
- Canada Research Chair in Forest Genomics, Laval University, Quebec City, QC G1V 0A6, Canada
| | - Steven J M Jones
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Joerg Bohlmann
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Department of Botany, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Jean Bousquet
- Canada Research Chair in Forest Genomics, Laval University, Quebec City, QC G1V 0A6, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Ashley Thomson
- Faculty of Natural Resources Management, Lakehead University, Thunder Bay, ON P7B 5E1, Canada
| |
Collapse
|
3
|
Wong J, Kazemi P, Coombe L, Warren RL, Birol I. aaHash: recursive amino acid sequence hashing. Bioinform Adv 2023; 3:vbad162. [PMID: 38023332 PMCID: PMC10660294 DOI: 10.1093/bioadv/vbad162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/13/2023] [Accepted: 11/08/2023] [Indexed: 12/01/2023]
Abstract
Motivation K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ∼10× faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Parham Kazemi
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
4
|
Li JX, Fernandez KX, Ritland C, Jancsik S, Engelhardt DB, Coombe L, Warren RL, van Belkum MJ, Carroll AL, Vederas JC, Bohlmann J, Birol I. Genomic virulence features of Beauveria bassiana as a biocontrol agent for the mountain pine beetle population. BMC Genomics 2023; 24:390. [PMID: 37430186 DOI: 10.1186/s12864-023-09473-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 06/21/2023] [Indexed: 07/12/2023] Open
Abstract
BACKGROUND The mountain pine beetle, Dendroctonus ponderosae, is an irruptive bark beetle that causes extensive mortality to many pine species within the forests of western North America. Driven by climate change and wildfire suppression, a recent mountain pine beetle (MPB) outbreak has spread across more than 18 million hectares, including areas to the east of the Rocky Mountains that comprise populations and species of pines not previously affected. Despite its impacts, there are few tactics available to control MPB populations. Beauveria bassiana is an entomopathogenic fungus used as a biological agent in agriculture and forestry and has potential as a management tactic for the mountain pine beetle population. This work investigates the phenotypic and genomic variation between B. bassiana strains to identify optimal strains against a specific insect. RESULTS Using comparative genome and transcriptome analyses of eight B. bassiana isolates, we have identified the genetic basis of virulence, which includes oosporein production. Genes unique to the more virulent strains included functions in biosynthesis of mycotoxins, membrane transporters, and transcription factors. Significant differential expression of genes related to virulence, transmembrane transport, and stress response was identified between the different strains, as well as up to nine-fold upregulation of genes involved in the biosynthesis of oosporein. Differential correlation analysis revealed transcription factors that may be involved in regulating oosporein production. CONCLUSION This study provides a foundation for the selection and/or engineering of the most effective strain of B. bassiana for the biological control of mountain pine beetle and other insect pests populations.
Collapse
Affiliation(s)
- Janet X Li
- Michael Smith Laboratories, University of British Columbia, 2185 East Mall, Vancouver, BC, V6T 1Z4, Canada.
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 W 7th Ave #100, Vancouver, BC, V5Z 4S6, Canada.
| | - Kleinberg X Fernandez
- Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, AB, T6G 2G2, Canada
| | - Carol Ritland
- Michael Smith Laboratories, University of British Columbia, 2185 East Mall, Vancouver, BC, V6T 1Z4, Canada
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Sharon Jancsik
- Michael Smith Laboratories, University of British Columbia, 2185 East Mall, Vancouver, BC, V6T 1Z4, Canada
| | - Daniel B Engelhardt
- Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, AB, T6G 2G2, Canada
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 W 7th Ave #100, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 W 7th Ave #100, Vancouver, BC, V5Z 4S6, Canada
| | - Marco J van Belkum
- Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, AB, T6G 2G2, Canada
| | - Allan L Carroll
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - John C Vederas
- Department of Chemistry, University of Alberta, 11227 Saskatchewan Drive NW, Edmonton, AB, T6G 2G2, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, 2185 East Mall, Vancouver, BC, V6T 1Z4, Canada
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 W 7th Ave #100, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
5
|
Wong J, Coombe L, Nikolić V, Zhang E, Nip KM, Sidhu P, Warren RL, Birol I. Linear time complexity de novo long read genome assembly with GoldRush. Nat Commun 2023; 14:2906. [PMID: 37217507 DOI: 10.1038/s41467-023-38716-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 05/11/2023] [Indexed: 05/24/2023] Open
Abstract
Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap - its most costly step - was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Emily Zhang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Puneet Sidhu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
| |
Collapse
|
6
|
Nip KM, Hafezqorani S, Gagalova KK, Chiu R, Yang C, Warren RL, Birol I. Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2. Nat Commun 2023; 14:2940. [PMID: 37217540 PMCID: PMC10202958 DOI: 10.1038/s41467-023-38553-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Accepted: 05/08/2023] [Indexed: 05/24/2023] Open
Abstract
Long-read sequencing technologies have improved significantly since their emergence. Their read lengths, potentially spanning entire transcripts, is advantageous for reconstructing transcriptomes. Existing long-read transcriptome assembly methods are primarily reference-based and to date, there is little focus on reference-free transcriptome assembly. We introduce "RNA-Bloom2 [ https://github.com/bcgsc/RNA-Bloom ]", a reference-free assembly method for long-read transcriptome sequencing data. Using simulated datasets and spike-in control data, we show that the transcriptome assembly quality of RNA-Bloom2 is competitive to those of reference-based methods. Furthermore, we find that RNA-Bloom2 requires 27.0 to 80.6% of the peak memory and 3.6 to 10.8% of the total wall-clock runtime of a competing reference-free method. Finally, we showcase RNA-Bloom2 in assembling a transcriptome sample of Picea sitchensis (Sitka spruce). Since our method does not rely on a reference, it further sets the groundwork for large-scale comparative transcriptomics where high-quality draft genome assemblies are not readily available.
Collapse
Affiliation(s)
- Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada.
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - Kristina K Gagalova
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - Readman Chiu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
| |
Collapse
|
7
|
Wong J, Kazemi P, Coombe L, Warren RL, Birol I. aaHash: recursive amino acid sequence hashing. bioRxiv 2023:2023.05.08.539909. [PMID: 37214907 PMCID: PMC10197579 DOI: 10.1101/2023.05.08.539909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Motivation K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.
Collapse
Affiliation(s)
- Johnathan Wong
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Parham Kazemi
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L. Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
8
|
Yoo S, Garg E, Elliott LT, Hung RJ, Halevy AR, Brooks JD, Bull SB, Gagnon F, Greenwood C, Lawless JF, Paterson AD, Sun L, Zawati MH, Lerner-Ellis J, Abraham R, Birol I, Bourque G, Garant JM, Gosselin C, Li J, Whitney J, Thiruvahindrapuram B, Herbrick JA, Lorenti M, Reuter MS, Adeoye OO, Liu S, Allen U, Bernier FP, Biggs CM, Cheung AM, Cowan J, Herridge M, Maslove DM, Modi BP, Mooser V, Morris SK, Ostrowski M, Parekh RS, Pfeffer G, Suchowersky O, Taher J, Upton J, Warren RL, Yeung R, Aziz N, Turvey SE, Knoppers BM, Lathrop M, Jones S, Scherer SW, Strug LJ. HostSeq: a Canadian whole genome sequencing and clinical data resource. BMC Genom Data 2023; 24:26. [PMID: 37131148 PMCID: PMC10152008 DOI: 10.1186/s12863-023-01128-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 02/22/2023] [Indexed: 05/04/2023] Open
Abstract
HostSeq was launched in April 2020 as a national initiative to integrate whole genome sequencing data from 10,000 Canadians infected with SARS-CoV-2 with clinical information related to their disease experience. The mandate of HostSeq is to support the Canadian and international research communities in their efforts to understand the risk factors for disease and associated health outcomes and support the development of interventions such as vaccines and therapeutics. HostSeq is a collaboration among 13 independent epidemiological studies of SARS-CoV-2 across five provinces in Canada. Aggregated data collected by HostSeq are made available to the public through two data portals: a phenotype portal showing summaries of major variables and their distributions, and a variant search portal enabling queries in a genomic region. Individual-level data is available to the global research community for health research through a Data Access Agreement and Data Access Compliance Office approval. Here we provide an overview of the collective project design along with summary level information for HostSeq. We highlight several statistical considerations for researchers using the HostSeq platform regarding data aggregation, sampling mechanism, covariate adjustment, and X chromosome analysis. In addition to serving as a rich data source, the diversity of study designs, sample sizes, and research objectives among the participating studies provides unique opportunities for the research community.
Collapse
Affiliation(s)
- S Yoo
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Ottawa, Ottawa, ON, Canada
| | - E Garg
- Simon Fraser University, Burnaby, BC, Canada
| | - L T Elliott
- Simon Fraser University, Burnaby, BC, Canada
| | - R J Hung
- University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - A R Halevy
- The Hospital for Sick Children, Toronto, ON, Canada
| | - J D Brooks
- University of Toronto, Toronto, ON, Canada
| | - S B Bull
- University of Toronto, Toronto, ON, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, Canada
| | - F Gagnon
- University of Toronto, Toronto, ON, Canada
| | - Cmt Greenwood
- McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - J F Lawless
- University of Waterloo, Waterloo, ON, Canada
| | - A D Paterson
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - L Sun
- University of Toronto, Toronto, ON, Canada
| | | | - J Lerner-Ellis
- University of Toronto, Toronto, ON, Canada
- Sinai Health System, Toronto, ON, Canada
| | - Rjs Abraham
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - I Birol
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - G Bourque
- McGill University, Montreal, QC, Canada
| | - J-M Garant
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - C Gosselin
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - J Li
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - J Whitney
- The Hospital for Sick Children, Toronto, ON, Canada
| | | | - J-A Herbrick
- The Hospital for Sick Children, Toronto, ON, Canada
| | - M Lorenti
- The Hospital for Sick Children, Toronto, ON, Canada
| | - M S Reuter
- The Hospital for Sick Children, Toronto, ON, Canada
| | - O O Adeoye
- The Hospital for Sick Children, Toronto, ON, Canada
| | - S Liu
- The Hospital for Sick Children, Toronto, ON, Canada
| | - U Allen
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - F P Bernier
- University of Calgary, Calgary, AB, Canada
- Alberta Children's Hospital, Calgary, AB, Canada
| | - C M Biggs
- University of British Columbia, Vancouver, BC, Canada
- BC Children's Hospital, Vancouver, BC, Canada
- St. Paul's Hospital, Vancouver, BC, Canada
| | - A M Cheung
- University Health Network, Toronto, ON, Canada
| | - J Cowan
- University of Ottawa, Ottawa, ON, Canada
- The Ottawa Hospital Research Institute, Ottawa, ON, Canada
| | - M Herridge
- University Health Network, Toronto, ON, Canada
| | | | - B P Modi
- BC Children's Hospital, Vancouver, BC, Canada
| | - V Mooser
- McGill University, Montreal, QC, Canada
| | - S K Morris
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - M Ostrowski
- University of Toronto, Toronto, ON, Canada
- St. Michael's Hospital, Unity Health, Toronto, ON, Canada
| | - R S Parekh
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
- Women's College Hospital, Toronto, ON, Canada
| | - G Pfeffer
- University of Calgary, Calgary, AB, Canada
| | | | - J Taher
- University of Toronto, Toronto, ON, Canada
- Sinai Health System, Toronto, ON, Canada
| | - J Upton
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - R L Warren
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - Rsm Yeung
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - N Aziz
- The Hospital for Sick Children, Toronto, ON, Canada
| | - S E Turvey
- University of British Columbia, Vancouver, BC, Canada
- BC Children's Hospital, Vancouver, BC, Canada
| | | | - M Lathrop
- McGill University, Montreal, QC, Canada
| | - Sjm Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, Canada
| | - S W Scherer
- The Hospital for Sick Children, Toronto, ON, Canada
- University of Toronto, Toronto, ON, Canada
| | - L J Strug
- The Hospital for Sick Children, Toronto, ON, Canada.
- University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
9
|
Coombe L, Warren RL, Wong J, Nikolic V, Birol I. ntLink: A Toolkit for De Novo Genome Assembly Scaffolding and Mapping Using Long Reads. Curr Protoc 2023; 3:e733. [PMID: 37039735 PMCID: PMC10091225 DOI: 10.1002/cpz1.733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
With the increasing affordability and accessibility of genome sequencing data, de novo genome assembly is an important first step to a wide variety of downstream studies and analyses. Therefore, bioinformatics tools that enable the generation of high-quality genome assemblies in a computationally efficient manner are essential. Recent developments in long-read sequencing technologies have greatly benefited genome assembly work, including scaffolding, by providing long-range evidence that can aid in resolving the challenging repetitive regions of complex genomes. ntLink is a flexible and resource-efficient genome scaffolding tool that utilizes long-read sequencing data to improve upon draft genome assemblies built from any sequencing technologies, including the same long reads. Instead of using read alignments to identify candidate joins, ntLink utilizes minimizer-based mappings to infer how input sequences should be ordered and oriented into scaffolds. Recent improvements to ntLink have added important features such as overlap detection, gap-filling, and in-code scaffolding iterations. Here, we present three basic protocols demonstrating how to use each of these new features to yield highly contiguous genome assemblies, while still maintaining ntLink's proven computational efficiency. Further, as we illustrate in the alternate protocols, the lightweight minimizer-based mappings that enable ntLink scaffolding can also be utilized for other downstream applications, such as misassembly detection. With its modularity and multiple modes of execution, ntLink has broad benefit to the genomics community, from genome scaffolding and beyond. ntLink is an open-source project and is freely available from https://github.com/bcgsc/ntLink. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ntLink scaffolding using overlap detection Basic Protocol 2: ntLink scaffolding with gap-filling Basic Protocol 3: Running in-code iterations of ntLink scaffolding Alternate Protocol 1: Generating long-read to contig mappings with ntLink Alternate Protocol 2: Using ntLink mappings for genome assembly correction with Tigmint-long Support Protocol: Installing ntLink.
Collapse
Affiliation(s)
- Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7 Ave, Vancouver, BC V5Z 4S6, 604-707-5900
| | - René L. Warren
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7 Ave, Vancouver, BC V5Z 4S6, 604-707-5900
| | - Johnathan Wong
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7 Ave, Vancouver, BC V5Z 4S6, 604-707-5900
| | - Vladimir Nikolic
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7 Ave, Vancouver, BC V5Z 4S6, 604-707-5900
| | - Inanc Birol
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7 Ave, Vancouver, BC V5Z 4S6, 604-707-5900
| |
Collapse
|
10
|
Yang C, Lo T, Nip KM, Hafezqorani S, Warren RL, Birol I. Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim. Gigascience 2023; 12:giad013. [PMID: 36939007 PMCID: PMC10025935 DOI: 10.1093/gigascience/giad013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 01/19/2023] [Accepted: 02/17/2023] [Indexed: 03/21/2023] Open
Abstract
BACKGROUND Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. RESULTS Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. CONCLUSIONS The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.
Collapse
Affiliation(s)
- Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Life Sciences Centre Room 1364 – 2350 Health Science Mall Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
11
|
Li C, Warren RL, Birol I. Models and data of AMPlify: a deep learning tool for antimicrobial peptide prediction. BMC Res Notes 2023; 16:11. [PMID: 36732807 PMCID: PMC9896668 DOI: 10.1186/s13104-023-06279-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Accepted: 01/24/2023] [Indexed: 02/04/2023] Open
Abstract
OBJECTIVES Antibiotic resistance is a rising global threat to human health and is prompting researchers to seek effective alternatives to conventional antibiotics, which include antimicrobial peptides (AMPs). Recently, we have reported AMPlify, an attentive deep learning model for predicting AMPs in databases of peptide sequences. In our tests, AMPlify outperformed the state-of-the-art. We have illustrated its use on data describing the American bullfrog (Rana [Lithobates] catesbeiana) genome. Here we present the model files and training/test data sets we used in that study. The original model (the balanced model) was trained on a balanced set of AMP and non-AMP sequences curated from public databases. In this data note, we additionally provide a model trained on an imbalanced set, in which non-AMP sequences far outnumber AMP sequences. We note that the balanced and imbalanced models would serve different use cases, and both would serve the research community, facilitating the discovery and development of novel AMPs. DATA DESCRIPTION This data note provides two sets of models, as well as two AMP and four non-AMP sequence sets for training and testing the balanced and imbalanced models. Each model set includes five single sub-models that form an ensemble model. The first model set corresponds to the original model trained on a balanced training set that has been described in the original AMPlify manuscript, while the second model set was trained on an imbalanced training set.
Collapse
Affiliation(s)
- Chenkai Li
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
- Public Health Laboratory, British Columbia Centre for Disease Control, Vancouver, BC, V5Z 4R4, Canada.
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3N1, Canada.
| |
Collapse
|
12
|
Shalev TJ, Gamal El-Dien O, Yuen MM, Shengqiang S, Jackman SD, Warren RL, Coombe L, van der Merwe L, Stewart A, Boston LB, Plott C, Jenkins J, He G, Yan J, Yan M, Guo J, Breinholt JW, Neves LG, Grimwood J, Rieseberg LH, Schmutz J, Birol I, Kirst M, Yanchuk AD, Ritland C, Russell JH, Bohlmann J. The western redcedar genome reveals low genetic diversity in a self-compatible conifer. Genome Res 2022; 32:1952-1964. [DOI: 10.1101/gr.276358.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 09/06/2022] [Indexed: 11/24/2022]
Abstract
We assembled the 9.8 Gbp genome of western redcedar (WRC, Thuja plicata), an ecologically and economically important conifer species of the Cupressaceae. The genome assembly, derived from a uniquely inbred tree produced through five generations of self-fertilization (selfing), was determined to be 86% complete by BUSCO analysis - one of the most complete genome assemblies for a conifer. Population genomic analysis revealed WRC to be one of the most genetically depauperate wild plant species, with an effective population size of approximately 300 and no significant genetic differentiation across its geographic range. Nucleotide diversity, π, is low for a continuous tree species, with many loci exhibiting zero diversity, and the ratio of π at zero- to four-fold degenerate sites is relatively high (~ 0.33), suggestive of weak purifying selection. Using an array of genetic lines derived from up to five generations of selfing, we explored the relationship between genetic diversity and mating system. While overall heterozygosity was found to decline faster than expected during selfing, heterozygosity persisted at many loci, and nearly 100 loci were found to deviate from expectations of genetic drift, suggestive of associative overdominance. Nonreference alleles at such loci often harbor deleterious mutations and are rare in natural populations, implying that balanced polymorphisms are maintained by linkage to dominant beneficial alleles. This may account for how WRC remains responsive to natural and artificial selection, despite low genetic diversity.
Collapse
|
13
|
Gagalova KK, Warren RL, Coombe L, Wong J, Nip KM, Yuen MMS, Whitehill JGA, Celedon JM, Ritland C, Taylor GA, Cheng D, Plettner P, Hammond SA, Mohamadi H, Zhao Y, Moore RA, Mungall AJ, Boyle B, Laroche J, Cottrell J, Mackay JJ, Lamothe M, Gérardi S, Isabel N, Pavy N, Jones SJM, Bohlmann J, Bousquet J, Birol I. Spruce giga-genomes: structurally similar yet distinctive with differentially expanding gene families and rapidly evolving genes. Plant J 2022; 111:1469-1485. [PMID: 35789009 DOI: 10.1111/tpj.15889] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 06/22/2022] [Accepted: 06/27/2022] [Indexed: 06/15/2023]
Abstract
Spruces (Picea spp.) are coniferous trees widespread in boreal and mountainous forests of the northern hemisphere, with large economic significance and enormous contributions to global carbon sequestration. Spruces harbor very large genomes with high repetitiveness, hampering their comparative analysis. Here, we present and compare the genomes of four different North American spruces: the genome assemblies for Engelmann spruce (Picea engelmannii) and Sitka spruce (Picea sitchensis) together with improved and more contiguous genome assemblies for white spruce (Picea glauca) and for a naturally occurring introgress of these three species known as interior spruce (P. engelmannii × glauca × sitchensis). The genomes were structurally similar, and a large part of scaffolds could be anchored to a genetic map. The composition of the interior spruce genome indicated asymmetric contributions from the three ancestral genomes. Phylogenetic analysis of the nuclear and organelle genomes revealed a topology indicative of ancient reticulation. Different patterns of expansion of gene families among genomes were observed and related with presumed diversifying ecological adaptations. We identified rapidly evolving genes that harbored high rates of non-synonymous polymorphisms relative to synonymous ones, indicative of positive selection and its hitchhiking effects. These gene sets were mostly distinct between the genomes of ecologically contrasted species, and signatures of convergent balancing selection were detected. Stress and stimulus response was identified as the most frequent function assigned to expanding gene families and rapidly evolving genes. These two aspects of genomic evolution were complementary in their contribution to divergent evolution of presumed adaptive nature. These more contiguous spruce giga-genome sequences should strengthen our understanding of conifer genome structure and evolution, as their comparison offers clues into the genetic basis of adaptation and ecology of conifers at the genomic level. They will also provide tools to better monitor natural genetic diversity and improve the management of conifer forests. The genomes of four closely related North American spruces indicate that their high similarity at the morphological level is paralleled by the high conservation of their physical genome structure. Yet, the evidence of divergent evolution is apparent in their rapidly evolving genomes, supported by differential expansion of key gene families and large sets of genes under positive selection, largely in relation to stimulus and environmental stress response.
Collapse
Affiliation(s)
- Kristina K Gagalova
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Macaire Man Saint Yuen
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Justin G A Whitehill
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Jose M Celedon
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Carol Ritland
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Greg A Taylor
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Dean Cheng
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Patrick Plettner
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
- Next-Generation Sequencing Facility, University of Saskatchewan, Saskatoon, SK, S7N 5E5, Canada
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Yongjun Zhao
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Richard A Moore
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Andrew J Mungall
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Brian Boyle
- Institute for Systems and Integrative Biology, Université Laval, Québec, QC, GIV 0A6, Canada
| | - Jérôme Laroche
- Institute for Systems and Integrative Biology, Université Laval, Québec, QC, GIV 0A6, Canada
| | - Joan Cottrell
- Forest Research, U.K. Forestry Commission, Northern Research Station, Roslin, EH25 9SY, Midlothian, UK
| | - John J Mackay
- Department of Plant Sciences, University of Oxford, Oxford, OX1 3RB, UK
| | - Manuel Lamothe
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, Québec, QC, G1V 4C7, Canada
| | - Sébastien Gérardi
- Institute for Systems and Integrative Biology, Université Laval, Québec, QC, GIV 0A6, Canada
- Canada Research Chair in Forest Genomics, Forest Research Centre, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Nathalie Isabel
- Natural Resources Canada, Canadian Forest Service, Laurentian Forestry Centre, Québec, QC, G1V 4C7, Canada
- Canada Research Chair in Forest Genomics, Forest Research Centre, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Nathalie Pavy
- Institute for Systems and Integrative Biology, Université Laval, Québec, QC, GIV 0A6, Canada
- Canada Research Chair in Forest Genomics, Forest Research Centre, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Jean Bousquet
- Institute for Systems and Integrative Biology, Université Laval, Québec, QC, GIV 0A6, Canada
- Canada Research Chair in Forest Genomics, Forest Research Centre, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
14
|
Kazemi P, Wong J, Nikolić V, Mohamadi H, Warren RL, Birol I. ntHash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics 2022; 38:4812-4813. [PMID: 36000872 PMCID: PMC9563681 DOI: 10.1093/bioinformatics/btac564] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 07/21/2022] [Indexed: 11/29/2022] Open
Abstract
Motivation Spaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research. Results ntHash2 is up to 2.1× faster at hashing various spaced seeds than the previous version and 3.8× faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism. Availability and implementation ntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Parham Kazemi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.,Faculty of Science, University of British Columbia, Vancouver, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | | | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, Canada
| |
Collapse
|
15
|
Lin D, Sutherland D, Aninta SI, Louie N, Nip KM, Li C, Yanai A, Coombe L, Warren RL, Helbing CC, Hoang LMN, Birol I. Mining Amphibian and Insect Transcriptomes for Antimicrobial Peptide Sequences with rAMPage. Antibiotics (Basel) 2022; 11:antibiotics11070952. [PMID: 35884206 PMCID: PMC9312091 DOI: 10.3390/antibiotics11070952] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 07/12/2022] [Accepted: 07/13/2022] [Indexed: 02/01/2023] Open
Abstract
Antibiotic resistance is a global health crisis increasing in prevalence every day. To combat this crisis, alternative antimicrobial therapeutics are urgently needed. Antimicrobial peptides (AMPs), a family of short defense proteins, are produced naturally by all organisms and hold great potential as effective alternatives to small molecule antibiotics. Here, we present rAMPage, a scalable bioinformatics discovery platform for identifying AMP sequences from RNA sequencing (RNA-seq) datasets. In our study, we demonstrate the utility and scalability of rAMPage, running it on 84 publicly available RNA-seq datasets from 75 amphibian and insect species—species known to have rich AMP repertoires. Across these datasets, we identified 1137 putative AMPs, 1024 of which were deemed novel by a homology search in cataloged AMPs in public databases. We selected 21 peptide sequences from this set for antimicrobial susceptibility testing against Escherichia coli and Staphylococcus aureus and observed that seven of them have high antimicrobial activity. Our study illustrates how in silico methods such as rAMPage can enable the fast and efficient discovery of novel antimicrobial peptides as an effective first step in the strenuous process of antimicrobial drug development.
Collapse
Affiliation(s)
- Diana Lin
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
| | - Darcy Sutherland
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
- British Columbia Centre for Disease Control, Public Health Laboratory, Vancouver, BC V6Z R4R, Canada;
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Sambina Islam Aninta
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
| | - Nathan Louie
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
| | - Ka Ming Nip
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Chenkai Li
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Anat Yanai
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
| | - Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
| | - René L. Warren
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
| | - Caren C. Helbing
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC V8P 5C2, Canada;
| | - Linda M. N. Hoang
- British Columbia Centre for Disease Control, Public Health Laboratory, Vancouver, BC V6Z R4R, Canada;
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Inanc Birol
- Canada’s Michael Smith Genome Sciences Centre at BC Cancer, Vancouver, BC V5Z 4S6, Canada; (D.L.); (D.S.); (S.I.A.); (N.L.); (K.M.N.); (C.L.); (A.Y.); (L.C.); (R.L.W.)
- British Columbia Centre for Disease Control, Public Health Laboratory, Vancouver, BC V6Z R4R, Canada;
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Correspondence:
| |
Collapse
|
16
|
Nikolić V, Afshinfard A, Chu J, Wong J, Coombe L, Nip KM, Warren RL, Birol I. RResolver: efficient short-read repeat resolution within ABySS. BMC Bioinformatics 2022; 23:246. [PMID: 35729491 PMCID: PMC9215042 DOI: 10.1186/s12859-022-04790-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 06/09/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes. RESULTS Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%. CONCLUSIONS RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .
Collapse
Affiliation(s)
- Vladimir Nikolić
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Amirhossein Afshinfard
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Justin Chu
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - Johnathan Wong
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Lauren Coombe
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Ka Ming Nip
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada ,grid.17091.3e0000 0001 2288 9830The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4 Canada
| | - René L. Warren
- grid.434706.20000 0004 0410 5424Canada’s Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6 Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre at BC Cancer, 570 W 7th Ave, Vancouver, V5Z 4S6, Canada. .,The University of British Columbia, 2329 West Mall, Vancouver, V6T 1Z4, Canada.
| |
Collapse
|
17
|
Li JX, Coombe L, Wong J, Birol I, Warren RL. ntEdit+Sealer: Efficient Targeted Error Resolution and Automated Finishing of Long-Read Genome Assemblies. Curr Protoc 2022; 2:e442. [PMID: 35567771 PMCID: PMC9196995 DOI: 10.1002/cpz1.442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
High‐quality genome assemblies are crucial to many biological studies, and utilizing long sequencing reads can help achieve higher assembly contiguity. While long reads can resolve complex and repetitive regions of a genome, their relatively high associated error rates are still a major limitation. Long reads generally produce draft genome assemblies with lower base quality, which must be corrected with a genome polishing step. Hybrid genome polishing solutions can greatly improve the quality of long‐read genome assemblies by utilizing more accurate short reads to validate bases and correct errors. Currently available hybrid polishing methods rely on read alignments, and are therefore memory‐intensive and do not scale well to large genomes. Here we describe ntEdit+Sealer, an alignment‐free, k‐mer‐based genome finishing protocol that employs memory‐efficient Bloom filters. The protocol includes ntEdit for correcting base errors and small indels, and for marking potentially problematic regions, then Sealer for filling both assembly gaps and problematic regions flagged by ntEdit. ntEdit+Sealer produces highly accurate, error‐corrected genome assemblies, and is available as a Makefile pipeline from https://github.com/bcgsc/ntedit_sealer_protocol. © 2022 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Automated long‐read genome finishing with short reads Support Protocol: Selecting optimal values for k‐mer lengths (k) and Bloom filter size (b)
Collapse
Affiliation(s)
- Janet X Li
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada.,Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada
| | - Lauren Coombe
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Center, Vancouver, BC, Canada
| |
Collapse
|
18
|
Li C, Sutherland D, Hammond SA, Yang C, Taho F, Bergman L, Houston S, Warren RL, Wong T, Hoang LMN, Cameron CE, Helbing CC, Birol I. AMPlify: attentive deep learning model for discovery of novel antimicrobial peptides effective against WHO priority pathogens. BMC Genomics 2022; 23:77. [PMID: 35078402 PMCID: PMC8788131 DOI: 10.1186/s12864-022-08310-4] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Accepted: 01/12/2022] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND Antibiotic resistance is a growing global health concern prompting researchers to seek alternatives to conventional antibiotics. Antimicrobial peptides (AMPs) are attracting attention again as therapeutic agents with promising utility in this domain, and using in silico methods to discover novel AMPs is a strategy that is gaining interest. Such methods can sift through large volumes of candidate sequences and reduce lab screening costs. RESULTS Here we introduce AMPlify, an attentive deep learning model for AMP prediction, and demonstrate its utility in prioritizing peptide sequences derived from the Rana [Lithobates] catesbeiana (bullfrog) genome. We tested the bioactivity of our predicted peptides against a panel of bacterial species, including representatives from the World Health Organization's priority pathogens list. Four of our novel AMPs were active against multiple species of bacteria, including a multi-drug resistant isolate of carbapenemase-producing Escherichia coli. CONCLUSIONS We demonstrate the utility of deep learning based tools like AMPlify in our fight against antibiotic resistance. We expect such tools to play a significant role in discovering novel candidates of peptide-based alternatives to classical antibiotics.
Collapse
Affiliation(s)
- Chenkai Li
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Darcy Sutherland
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Public Health Laboratory, British Columbia Centre for Disease Control, Vancouver, BC, V5Z 4R4, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Figali Taho
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Lauren Bergman
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8P 5C3, Canada
| | - Simon Houston
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8P 5C3, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Titus Wong
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Medical Microbiology Laboratory, Vancouver General Hospital, Vancouver, BC, V5Z 1M9, Canada
| | - Linda M N Hoang
- Public Health Laboratory, British Columbia Centre for Disease Control, Vancouver, BC, V5Z 4R4, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Caroline E Cameron
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8P 5C3, Canada
- Division of Infectious Diseases, Department of Medicine, University of Washington, Seattle, WA, 98195, USA
| | - Caren C Helbing
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8P 5C3, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
- Public Health Laboratory, British Columbia Centre for Disease Control, Vancouver, BC, V5Z 4R4, Canada.
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3N1, Canada.
| |
Collapse
|
19
|
Stephenson M, Nip KM, HafezQorani S, Gagalova KK, Yang C, Warren RL, Birol I. RNA-Scoop: interactive visualization of transcripts in single-cell transcriptomes. NAR Genom Bioinform 2021; 3:lqab105. [PMID: 34859209 PMCID: PMC8633890 DOI: 10.1093/nargab/lqab105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 08/21/2021] [Accepted: 11/26/2021] [Indexed: 11/12/2022] Open
Abstract
Recent advances in single-cell RNA sequencing technologies have made detection of transcripts in single cells possible. The level of resolution provided by these technologies can be used to study changes in transcript usage across cell populations and help investigate new biology. Here, we introduce RNA-Scoop, an interactive cell cluster and transcriptome visualization tool to analyze transcript usage across cell categories and clusters. The tool allows users to examine differential transcript expression across clusters and investigate how usage of specific transcript expression mechanisms varies across cell groups.
Collapse
Affiliation(s)
- Maria Stephenson
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Computer Science Co-op Program, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V5Z 4S6, Canada
| | - Saber HafezQorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V5Z 4S6, Canada
| | - Kristina K Gagalova
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V5Z 4S6, Canada
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6H 3N1, Canada
| |
Collapse
|
20
|
Coombe L, Li JX, Lo T, Wong J, Nikolic V, Warren RL, Birol I. LongStitch: high-quality genome assembly correction and scaffolding using long reads. BMC Bioinformatics 2021; 22:534. [PMID: 34717540 PMCID: PMC8557608 DOI: 10.1186/s12859-021-04451-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 10/19/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch .
Collapse
Affiliation(s)
- Lauren Coombe
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada.
| | - Janet X Li
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Johnathan Wong
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Vladimir Nikolic
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Research, 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
21
|
Warren RL, Birol I. HLA alleles measured from COVID-19 patient transcriptomes reveal associations with disease prognosis in a New York cohort. PeerJ 2021; 9:e12368. [PMID: 34722002 PMCID: PMC8522641 DOI: 10.7717/peerj.12368] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 10/01/2021] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The Human Leukocyte Antigen (HLA) gene locus plays a fundamental role in human immunity, and it is established that certain HLA alleles are disease determinants. Previously, we have identified prevalent HLA class I and class II alleles, including DPA1*02:02, in two small patient cohorts at the COVID-19 pandemic onset. METHODS We have since analyzed a larger public patient cohort data (n = 126 patients) with controls, associated demographic and clinical data. By combining the predictive power of multiple in silico HLA predictors, we report on HLA-I and HLA-II alleles, along with their associated risk significance. RESULTS We observe HLA-II DPA1*02:02 at a higher frequency in the COVID-19 positive cohort (29%) when compared to the COVID-negative control group (Fisher's exact test [FET] p = 0.0174). Having this allele, however, does not appear to put this cohort's patients at an increased risk of hospitalization. Inspection of COVID-19 disease severity outcomes, including admission to intensive care, reveal nominally significant risk associations with A*11:01 (FET p = 0.0078) and C*04:01 (FET p = 0.0087). The association with severe disease outcome is especially evident for patients with C*04:01, where disease prognosis measured by mechanical ventilation-free days was statistically significant after multiple hypothesis correction (Bonferroni p = 0.0323). While prevalence of some of these alleles falls below statistical significance after Bonferroni correction, COVID-19 patients with HLA-I C*04:01 tend to fare worse overall. This HLA allele may hold potential clinical value.
Collapse
Affiliation(s)
- René L. Warren
- Genome Sciences Centre, BC Cancer, Vancouver, CA-BC, Canada
| | - Inanc Birol
- Genome Sciences Centre, BC Cancer, Vancouver, CA-BC, Canada
| |
Collapse
|
22
|
Jackman SD, Coombe L, Warren RL, Kirk H, Trinh E, MacLeod T, Pleasance S, Pandoh P, Zhao Y, Coope RJ, Bousquet J, Bohlmann J, Jones SJM, Birol I. Complete Mitochondrial Genome of a Gymnosperm, Sitka Spruce (Picea sitchensis), Indicates a Complex Physical Structure. Genome Biol Evol 2021; 12:1174-1179. [PMID: 32449750 PMCID: PMC7486957 DOI: 10.1093/gbe/evaa108] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/20/2020] [Indexed: 12/12/2022] Open
Abstract
Plant mitochondrial genomes vary widely in size. Although many plant mitochondrial genomes have been sequenced and assembled, the vast majority are of angiosperms, and few are of gymnosperms. Most plant mitochondrial genomes are smaller than a megabase, with a few notable exceptions. We have sequenced and assembled the complete 5.5-Mb mitochondrial genome of Sitka spruce (Picea sitchensis), to date, one of the largest mitochondrial genomes of a gymnosperm. We sequenced the whole genome using Oxford Nanopore MinION, and then identified contigs of mitochondrial origin assembled from these long reads based on sequence homology to the white spruce mitochondrial genome. The assembly graph shows a multipartite genome structure, composed of one smaller 168-kb circular segment of DNA, and a larger 5.4-Mb single component with a branching structure. The assembly graph gives insight into a putative complex physical genome structure, and its branching points may represent active sites of recombination.
Collapse
Affiliation(s)
- Shaun D Jackman
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Lauren Coombe
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Heather Kirk
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Eva Trinh
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Tina MacLeod
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Stephen Pleasance
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Pawan Pandoh
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Yongjun Zhao
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Robin J Coope
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Jean Bousquet
- Forest Genomics, Institute for Systems and Integrative Biology, Université Laval, Quebec, Quebec, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia, Canada
| | - Steven J M Jones
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| | - Inanc Birol
- Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada
| |
Collapse
|
23
|
Abstract
As the year 2020 came to a close, several new strains have been reported of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the agent responsible for the coronavirus disease 2019 (COVID-19) pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evolution of the coronavirus, we built interactive scalable vector graphics maps that show daily nucleotide variations in genomes from the six most populated continents compared to that of the initial, ground-zero SARS-CoV-2 isolate sequenced at the beginning of the year. Availability: The tool used to perform the reported mutation analysis results, ntEdit, is available from GitHub. Genome mutation reports are available for download from BCGSC. Mutation time maps are available from https://bcgsc.github.io/SARS2/.
Collapse
Affiliation(s)
- René L. Warren
- Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Inanc Birol
- Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| |
Collapse
|
24
|
Abstract
As the year 2020 came to a close, several new strains have been reported of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the agent responsible for the coronavirus disease 2019 (COVID-19) pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evolution of the coronavirus, we built interactive scalable vector graphics maps that show daily nucleotide variations in genomes from the six most populated continents compared to that of the initial, ground-zero SARS-CoV-2 isolate sequenced at the beginning of the pandemic. Availability: The tool used to perform the reported mutation analysis results, ntEdit, is available from GitHub. Genome mutation reports are available for download from BCGSC. Mutation time maps are available from https://bcgsc.github.io/SARS2/.
Collapse
Affiliation(s)
- René L. Warren
- Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| | - Inanc Birol
- Genome Sciences Centre, BC Cancer Agency, Vancouver, British Columbia, V5Z 4S6, Canada
| |
Collapse
|
25
|
Warren RL, Birol I. HLA predictions from the bronchoalveolar lavage fluid and blood samples of eight COVID-19 patients at the pandemic onset. Bioinformatics 2021; 36:5271-5273. [PMID: 32853340 PMCID: PMC7540287 DOI: 10.1093/bioinformatics/btaa756] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 08/18/2020] [Accepted: 08/20/2020] [Indexed: 12/16/2022] Open
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
26
|
Warren RL, Birol I. Interactive SARS-CoV-2 mutation timemaps. ArXiv 2020:2012.15697. [PMID: 33398246 PMCID: PMC7781321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
As the year 2020 draws to an end, several new strains have been reported for the SARS-CoV-2 coronavirus, the agent responsible for the COVID-19 pandemic that has afflicted us all this past year. However, it is difficult to comprehend the scale, in sequence space, geographical location and time, at which SARS-CoV-2 mutates and evolves in its human hosts. To get an appreciation for the rapid evolution of the coronavirus, we built interactive scalable vector graphics maps that show daily nucleotide variations in genomes from the six most populated continents compared to that of the initial, ground-zero SARS-CoV-2 isolate sequenced at the beginning of the year. Availability: Mutation time maps are available from https://bcgsc.github.io/SARS2/.
Collapse
|
27
|
Abstract
BACKGROUND The Human Leukocyte Antigen (HLA) gene locus plays a fundamental role in human immunity, and it is established that certain HLA alleles are disease determinants. METHODS By combining the predictive power of multiple in silico HLA predictors, we have previously identified prevalent HLA class I and class II alleles, including DPA1*02:02, in two small cohorts at the COVID-19 pandemic onset. Since then, newer and larger patient cohorts with controls and associated demographic and clinical data have been deposited in public repositories. Here, we report on HLA-I and HLA-II alleles, along with their associated risk significance in one such cohort of 126 patients, including COVID-19 positive (n=100) and negative patients (n=26). RESULTS We recapitulate an enrichment of DPA1*02:02 in the COVID-19 positive cohort (29%) when compared to the COVID-negative control group (Fisher's exact test [FET] p=0.0174). Having this allele, however, does not appear to put this cohort's patients at an increased risk of hospitalization. Inspection of COVID-19 disease severity outcomes reveal nominally significant risk associations with A*11:01 (FET p=0.0078), C*04:01 (FET p=0.0087) and DQA1*01:02 (FET p=0.0121). CONCLUSIONS While enrichment of these alleles falls below statistical significance after Bonferroni correction, COVID-19 patients with the latter three alleles tend to fare worse overall. This is especially evident for patients with C*04:01, where disease prognosis measured by mechanical ventilation-free days was statistically significant after multiple hypothesis correction (Bonferroni p = 0.0023), and may hold potential clinical value.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
28
|
Nip KM, Chiu R, Yang C, Chu J, Mohamadi H, Warren RL, Birol I. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Res 2020; 30:1191-1200. [PMID: 32817073 PMCID: PMC7462077 DOI: 10.1101/gr.260174.119] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 07/23/2020] [Indexed: 12/27/2022]
Abstract
Despite the rapid advance in single-cell RNA sequencing (scRNA-seq) technologies within the last decade, single-cell transcriptome analysis workflows have primarily used gene expression data while isoform sequence analysis at the single-cell level still remains fairly limited. Detection and discovery of isoforms in single cells is difficult because of the inherent technical shortcomings of scRNA-seq data, and existing transcriptome assembly methods are mainly designed for bulk RNA samples. To address this challenge, we developed RNA-Bloom, an assembly algorithm that leverages the rich information content aggregated from multiple single-cell transcriptomes to reconstruct cell-specific isoforms. Assembly with RNA-Bloom can be either reference-guided or reference-free, thus enabling unbiased discovery of novel isoforms or foreign transcripts. We compared both assembly strategies of RNA-Bloom against five state-of-the-art reference-free and reference-based transcriptome assembly methods. In our benchmarks on a simulated 384-cell data set, reference-free RNA-Bloom reconstructed 37.9%–38.3% more isoforms than the best reference-free assembler, whereas reference-guided RNA-Bloom reconstructed 4.1%–11.6% more isoforms than reference-based assemblers. When applied to a real 3840-cell data set consisting of more than 4 billion reads, RNA-Bloom reconstructed 9.7%–25.0% more isoforms than the best competing reference-based and reference-free approaches evaluated. We expect RNA-Bloom to boost the utility of scRNA-seq data beyond gene expression analysis, expanding what is informatically accessible now.
Collapse
Affiliation(s)
- Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6
| | - Readman Chiu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6
| | - Justin Chu
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, British Columbia, Canada V5Z 4S6.,Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada V6H 3N1
| |
Collapse
|
29
|
Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, Jones SJM, Bousquet J, Bohlmann J, Birol I. ntEdit: scalable genome sequence polishing. Bioinformatics 2020; 35:4430-4432. [PMID: 31095290 PMCID: PMC6821332 DOI: 10.1093/bioinformatics/btz400] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Revised: 03/04/2019] [Accepted: 05/07/2019] [Indexed: 02/05/2023] Open
Abstract
Motivation In the modern genomics era, genome sequence assemblies are routine practice. However, depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they work best with high read coverage and do not scale well. We developed ntEdit, a Bloom filter-based genome sequence editing utility that scales to large mammalian and conifer genomes. Results We first tested ntEdit and the state-of-the-art assembly improvement tools GATK, Pilon and Racon on controlled Escherichia coli and Caenorhabditis elegans sequence data. Generally, ntEdit performs well at low sequence depths (<20×), fixing the majority (>97%) of base substitutions and indels, and its performance is largely constant with increased coverage. In all experiments conducted using a single CPU, the ntEdit pipeline executed in <14 s and <3 m, on average, on E.coli and C.elegans, respectively. We performed similar benchmarks on a sub-20× coverage human genome sequence dataset, inspecting accuracy and resource usage in editing chromosomes 1 and 21, and whole genome. ntEdit scaled linearly, executing in 30–40 m on those sequences. We show how ntEdit ran in <2 h 20 m to improve upon long and linked read human genome assemblies of NA12878, using high-coverage (54×) Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources (seed megagametophyte), and used it to edit our pseudo haploid assemblies of the 20 Gb interior and white spruce genomes in <4 and <5 h, respectively, making roughly 50M edits at a (substitution+indel) rate of 0.0024. Availability and implementation https://github.com/bcgsc/ntedit Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Lauren Coombe
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | | | - Jessica Zhang
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| | - Barry Jaquish
- BC Ministry of Forests, Lands, and Natural Resource Operations, Victoria, Canada
| | - Nathalie Isabel
- Laurentian Forestry Centre, Natural Resources Canada, Québec, Canada
| | | | - Jean Bousquet
- Canada Research Chair in Forest Genomics, Université Laval, Québec, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, Canada
| |
Collapse
|
30
|
Hafezqorani S, Yang C, Lo T, Nip KM, Warren RL, Birol I. Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data. Gigascience 2020; 9:5855462. [PMID: 32520350 PMCID: PMC7285873 DOI: 10.1093/gigascience/giaa061] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Revised: 04/14/2020] [Accepted: 05/12/2020] [Indexed: 01/08/2023] Open
Abstract
Background Compared with second-generation sequencing technologies, third-generation single-molecule RNA sequencing has unprecedented advantages; the long reads it generates facilitate isoform-level transcript characterization. In particular, the Oxford Nanopore Technology sequencing platforms have become more popular in recent years owing to their relatively high affordability and portability compared with other third-generation sequencing technologies. To aid the development of analytical tools that leverage the power of this technology, simulated data provide a cost-effective solution with ground truth. However, a nanopore sequence simulator targeting transcriptomic data is not available yet. Findings We introduce Trans-NanoSim, a tool that simulates reads with technical and transcriptome-specific features learnt from nanopore RNA-sequncing data. We comprehensively benchmarked Trans-NanoSim on direct RNA and complementary DNA datasets describing human and mouse transcriptomes. Through comparison against other nanopore read simulators, we show the unique advantage and robustness of Trans-NanoSim in capturing the characteristics of nanopore complementary DNA and direct RNA reads. Conclusions As a cost-effective alternative to sequencing real transcriptomes, Trans-NanoSim will facilitate the rapid development of analytical tools for nanopore RNA-sequencing data. Trans-NanoSim and its pre-trained models are freely accessible at https://github.com/bcgsc/NanoSim.
Collapse
Affiliation(s)
- Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Bioinformatics Graduate Program, University of British Columbia, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Chen Yang
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Bioinformatics Graduate Program, University of British Columbia, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Bioinformatics Graduate Program, University of British Columbia, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, 100 - 570 W 7th Ave, Vancouver, BC Cancer, BC V5Z 4S6 Canada.,Department of Medical Genetics, University of British Columbia, 2350 Health Science Mall, Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
31
|
Coombe L, Nikolić V, Chu J, Birol I, Warren RL. ntJoin: Fast and lightweight assembly-guided scaffolding using minimizer graphs. Bioinformatics 2020; 36:3885-3887. [PMID: 32311025 PMCID: PMC7320612 DOI: 10.1093/bioinformatics/btaa253] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 03/23/2020] [Accepted: 04/14/2020] [Indexed: 11/17/2022] Open
Abstract
SUMMARY The ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short-read assembly with a draft long-read assembly and a draft assembly with an assembly from a closely related species. When scaffolding a human short-read assembly using the reference human genome or a long-read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using <11 GB of RAM. Compared to existing reference-guided scaffolders, ntJoin generates highly contiguous assemblies faster and using less memory. AVAILABILITY AND IMPLEMENTATION ntJoin is written in C++ and Python and is freely available at https://github.com/bcgsc/ntjoin. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Vladimir Nikolić
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Justin Chu
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - Inanc Birol
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada
| |
Collapse
|
32
|
Law WD, Warren RL, McCallion AS. Establishment of an eHAP1 human haploid cell line hybrid reference genome assembled from short and long reads. Genomics 2020; 112:2379-2384. [PMID: 31962144 DOI: 10.1016/j.ygeno.2020.01.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 01/13/2020] [Accepted: 01/15/2020] [Indexed: 12/31/2022]
Abstract
Haploid cell lines are a valuable research tool with broad applicability for genetic assays. As such the fully haploid human cell line, eHAP1, has been used in a wide array of studies. However, the absence of a corresponding reference genome sequence for this cell line has limited the potential for more widespread applications to experiments dependent on available sequence, like capture-clone methodologies. We generated ~15× coverage Nanopore long reads from ten GridION flowcells and utilized this data to assemble a de novo draft genome using minimap and miniasm and subsequently polished using Racon. This assembly was further polished using previously generated, low-coverage, Illumina short reads with Pilon and ntEdit. This resulted in a hybrid eHAP1 assembly with >90% complete BUSCO scores. We further assessed the eHAP1 long read data for structural variants using Sniffles and identify a variety of rearrangements, including a previously established Philadelphia translocation. Finally, we demonstrate how some of these variants overlap open chromatin regions, potentially impacting regulatory regions. By integrating both long and short reads, we generated a high-quality reference assembly for eHAP1 cells. The union of long and short reads demonstrates the utility in combining sequencing platforms to generate a high-quality reference genome de novo solely from low coverage data. We expect the resulting eHAP1 genome assembly to provide a useful resource to enable novel experimental applications in this important model cell line.
Collapse
Affiliation(s)
- William D Law
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| | - René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC V5Z 4S6, Canada.
| | - Andrew S McCallion
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA; Department of Molecular and Comparative Pathobiology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA; Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA.
| |
Collapse
|
33
|
Warren RL, Birol I. HLA predictions from the bronchoalveolar lavage fluid samples of five patients at the early stage of the wuhan seafood market COVID-19 outbreak. ArXiv 2020:arXiv:2004.07108v3. [PMID: 32550246 PMCID: PMC7280900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
We are in the midst of a global viral pandemic, one with no cure and a high mortality rate. The Human Leukocyte Antigen (HLA) gene complex plays a critical role in host immunity. We predicted HLA class I and II alleles from the transcriptome sequencing data prepared from the bronchoalveolar lavage fluid samples of five patients at the early stage of the COVID-19 outbreak. We identified the HLA-I allele A*24:02 in four out of five patients, which is higher than the expected frequency (17.2%) in the South Han Chinese population. The difference is statistically significant with a p-value less than 10-4. Our analysis results may help provide future insights on disease susceptibility.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanç Birol
- Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
34
|
Helbing CC, Hammond SA, Jackman SH, Houston S, Warren RL, Cameron CE, Birol I. Antimicrobial peptides from Rana [Lithobates] catesbeiana: Gene structure and bioinformatic identification of novel forms from tadpoles. Sci Rep 2019; 9:1529. [PMID: 30728430 PMCID: PMC6365531 DOI: 10.1038/s41598-018-38442-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 12/28/2018] [Indexed: 01/21/2023] Open
Abstract
Antimicrobial peptides (AMPs) exhibit broad-spectrum antimicrobial activity, and have promise as new therapeutic agents. While the adult North American bullfrog (Rana [Lithobates] catesbeiana) is a prolific source of high-potency AMPs, the aquatic tadpole represents a relatively untapped source for new AMP discovery. The recent publication of the bullfrog genome and transcriptomic resources provides an opportune bridge between known AMPs and bioinformatics-based AMP discovery. The objective of the present study was to identify novel AMPs with therapeutic potential using a combined bioinformatics and wet lab-based approach. In the present study, we identified seven novel AMP precursor-encoding transcripts expressed in the tadpole. Comparison of their amino acid sequences with known AMPs revealed evidence of mature peptide sequence conservation with variation in the prepro sequence. Two mature peptide sequences were unique and demonstrated bacteriostatic and bactericidal activity against Mycobacteria but not Gram-negative or Gram-positive bacteria. Nine known and seven novel AMP-encoding transcripts were detected in premetamorphic tadpole back skin, olfactory epithelium, liver, and/or tail fin. Treatment of tadpoles with 10 nM 3,5,3'-triiodothyronine for 48 h did not affect transcript abundance in the back skin, and had limited impact on these transcripts in the other three tissues. Gene mapping revealed considerable diversity in size (1.6-15 kbp) and exon number (one to four) of AMP-encoding genes with clear evidence of alternative splicing leading to both prepro and mature amino acid sequence diversity. These findings verify the accuracy and utility of the bullfrog genome assembly, and set a firm foundation for bioinformatics-based AMP discovery.
Collapse
Affiliation(s)
- Caren C Helbing
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada.
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Shireen H Jackman
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada
| | - Simon Houston
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Caroline E Cameron
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, British Columbia, V8P 5C2, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| |
Collapse
|
35
|
Abstract
Motivation Sequencing of human genomes is now routine, and assembly of shotgun reads is increasingly feasible. However, assemblies often fail to inform about chromosome-scale structure due to a lack of linkage information over long stretches of DNA—a shortcoming that is being addressed by new sequencing protocols, such as the GemCode and Chromium linked reads from 10 × Genomics. Results Here, we present ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiens genome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts. Availability and implementation https://github.com/bcgsc/ARCS/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
36
|
Xue Z, Warren RL, Gibb EA, MacMillan D, Wong J, Chiu R, Hammond SA, Yang C, Nip KM, Ennis CA, Hahn A, Reynolds S, Birol I. Recurrent tumor-specific regulation of alternative polyadenylation of cancer-related genes. BMC Genomics 2018; 19:536. [PMID: 30005633 PMCID: PMC6045855 DOI: 10.1186/s12864-018-4903-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 06/27/2018] [Indexed: 01/09/2023] Open
Abstract
Background Alternative polyadenylation (APA) results in messenger RNA molecules with different 3′ untranslated regions (3’ UTRs), affecting the molecules’ stability, localization, and translation. APA is pervasive and implicated in cancer. Earlier reports on APA focused on 3’ UTR length modifications and commonly characterized APA events as 3’ UTR shortening or lengthening. However, such characterization oversimplifies the processing of 3′ ends of transcripts and fails to adequately describe the various scenarios we observe. Results We built a cloud-based targeted de novo transcript assembly and analysis pipeline that incorporates our previously developed cleavage site prediction tool, KLEAT. We applied this pipeline to elucidate the APA profiles of 114 genes in 9939 tumor and 729 tissue normal samples from The Cancer Genome Atlas (TCGA). The full set of 10,668 RNA-Seq samples from 33 cancer types has not been utilized by previous APA studies. By comparing the frequencies of predicted cleavage sites between normal and tumor sample groups, we identified 77 events (i.e. gene-cancer type pairs) of tumor-specific APA regulation in 13 cancer types; for 15 genes, such regulation is recurrent across multiple cancers. Our results also support a previous report showing the 3’ UTR shortening of FGF2 in multiple cancers. However, over half of the events we identified display complex changes to 3’ UTR length that resist simple classification like shortening or lengthening. Conclusions Recurrent tumor-specific regulation of APA is widespread in cancer. However, the regulation pattern that we observed in TCGA RNA-seq data cannot be described as straightforward 3’ UTR shortening or lengthening. Continued investigation into this complex, nuanced regulatory landscape will provide further insight into its role in tumor formation and development. Electronic supplementary material The online version of this article (10.1186/s12864-018-4903-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zhuyi Xue
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Ewan A Gibb
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Daniel MacMillan
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Johnathan Wong
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Readman Chiu
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - S Austin Hammond
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Chen Yang
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Catherine A Ennis
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada
| | - Abigail Hahn
- Institute for Systems Biology, Seattle, 98109, WA, USA
| | | | - Inanc Birol
- BC Cancer Agency, Genome Sciences Centre, Vancouver, BC, V5Z 4S6, Canada. .,Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
| |
Collapse
|
37
|
Coombe L, Zhang J, Vandervalk BP, Chu J, Jackman SD, Birol I, Warren RL. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers. BMC Bioinformatics 2018; 19:234. [PMID: 29925315 PMCID: PMC6011487 DOI: 10.1186/s12859-018-2243-x] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 06/13/2018] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND The long-range sequencing information captured by linked reads, such as those available from 10× Genomics (10xG), helps resolve genome sequence repeats, and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. The reliance on kmers instead of read alignments for pairing sequences relaxes the workflow requirements, and drastically reduces the run time. RESULTS Here, we show how linked reads, when used in conjunction with Hi-C data for scaffolding, improve a draft human genome assembly of PacBio long-read data five-fold (baseline vs. ARKS NG50 = 4.6 vs. 23.1 Mbp, respectively). We also demonstrate how the method provides further improvements of a megabase-scale Supernova human genome assembly (NG50 = 14.74 Mbp vs. 25.94 Mbp before and after ARKS), which itself exclusively uses linked read data for assembly, with an execution speed six to nine times faster than competitive linked read scaffolders (~ 10.5 h compared to 75.7 h, on average). Following ARKS scaffolding of a human genome 10xG Supernova assembly (of cell line NA12878), fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n = 13). CONCLUSIONS ARKS uses a kmer mapping strategy instead of linked read alignments to record and associate the barcode information needed to order and orient draft assembly sequences. The simplified workflow, when compared to that of our initial implementation, ARCS, markedly improves run time performances on experimental human genome datasets. Furthermore, the novel distance estimator in ARKS utilizes barcoding information from linked reads to estimate gap sizes. It accomplishes this by modeling the relationship between known distances of a region within contigs and calculating associated Jaccard indices. ARKS has the potential to provide correct, chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping refine draft genomes.
Collapse
Affiliation(s)
- Lauren Coombe
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | - Jessica Zhang
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | | | - Justin Chu
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | | | - Inanc Birol
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| | - René L. Warren
- BC Cancer Genome Sciences Centre, Vancouver, BC V5Z 4S6 Canada
| |
Collapse
|
38
|
Kucuk E, Chu J, Vandervalk BP, Hammond SA, Warren RL, Birol I. Kollector: transcript-informed, targeted de novo assembly of gene loci. Bioinformatics 2018; 33:1782-1788. [PMID: 28186221 PMCID: PMC5572715 DOI: 10.1093/bioinformatics/btx078] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2016] [Accepted: 02/07/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Despite considerable advancements in sequencing and computing technologies, de novo assembly of whole eukaryotic genomes is still a time-consuming task that requires a significant amount of computational resources and expertise. A targeted assembly approach to perform local assembly of sequences of interest remains a valuable option for some applications. This is especially true for gene-centric assemblies, whose resulting sequence can be readily utilized for more focused biological research. Here we describe Kollector, an alignment-free targeted assembly pipeline that uses thousands of transcript sequences concurrently to inform the localized assembly of corresponding gene loci. Kollector robustly reconstructs introns and novel sequences within these loci, and scales well to large genomes—properties that makes it especially useful for researchers working on non-model eukaryotic organisms. Results We demonstrate the performance of Kollector for assembling complete or near-complete Caenorhabditis elegans and Homo sapiens gene loci from their respective, input transcripts. In a time- and memory-efficient manner, the Kollector pipeline successfully reconstructs respectively 99% and 80% (compared to 86% and 73% with standard de novo assembly techniques) of C.elegans and H.sapiens transcript targets in their corresponding genomic space using whole genome shotgun sequencing reads. We also show that Kollector outperforms both established and recently released targeted assembly tools. Finally, we demonstrate three use cases for Kollector, including comparative and cancer genomics applications. Availability and Implementation Kollector is implemented as a bash script, and is available at https://github.com/bcgsc/kollector Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erdi Kucuk
- University of British Columbia, Vancouver, BC, Canada.,Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Justin Chu
- University of British Columbia, Vancouver, BC, Canada.,Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Inanc Birol
- University of British Columbia, Vancouver, BC, Canada.,Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada.,Simon Fraser University, Burnaby, BC, Canada
| |
Collapse
|
39
|
Jones SJM, Taylor GA, Chan S, Warren RL, Hammond SA, Bilobram S, Mordecai G, Suttle CA, Miller KM, Schulze A, Chan AM, Jones SJ, Tse K, Li I, Cheung D, Mungall KL, Choo C, Ally A, Dhalla N, Tam AKY, Troussard A, Kirk H, Pandoh P, Paulino D, Coope RJN, Mungall AJ, Moore R, Zhao Y, Birol I, Ma Y, Marra M, Haulena M. The Genome of the Beluga Whale (Delphinapterus leucas). Genes (Basel) 2017; 8:genes8120378. [PMID: 29232881 PMCID: PMC5748696 DOI: 10.3390/genes8120378] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 11/28/2017] [Accepted: 12/01/2017] [Indexed: 12/17/2022] Open
Abstract
The beluga whale is a cetacean that inhabits arctic and subarctic regions, and is the only living member of the genus Delphinapterus. The genome of the beluga whale was determined using DNA sequencing approaches that employed both microfluidic partitioning library and non-partitioned library construction. The former allowed for the construction of a highly contiguous assembly with a scaffold N50 length of over 19 Mbp and total reconstruction of 2.32 Gbp. To aid our understanding of the functional elements, transcriptome data was also derived from brain, duodenum, heart, lung, spleen, and liver tissue. Assembled sequence and all of the underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the Bioproject accession number PRJNA360851A.
Collapse
Affiliation(s)
- Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Gregory A Taylor
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Simon Chan
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Steven Bilobram
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Gideon Mordecai
- Department of Earth, Ocean & Atmospheric Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
- Institute for the Oceans & Fisheries, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
| | - Curtis A Suttle
- Department of Earth, Ocean & Atmospheric Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
- Institute for the Oceans & Fisheries, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
- Department of Microbiology & Immunology, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
- Department of Botany, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
| | - Kristina M Miller
- Fisheries and Oceans Canada, Molecular Genetics Section, Pacific Biological Station, Nanaimo, BC V9R 5K6, Canada.
| | - Angela Schulze
- Fisheries and Oceans Canada, Molecular Genetics Section, Pacific Biological Station, Nanaimo, BC V9R 5K6, Canada.
| | - Amy M Chan
- Department of Earth, Ocean & Atmospheric Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
- Institute for the Oceans & Fisheries, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
| | - Samantha J Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Kane Tse
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Irene Li
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Dorothy Cheung
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Karen L Mungall
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Caleb Choo
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Adrian Ally
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Noreen Dhalla
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Angela K Y Tam
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Armelle Troussard
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Heather Kirk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Pawan Pandoh
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Daniel Paulino
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Robin J N Coope
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Andrew J Mungall
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Richard Moore
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Yongjun Zhao
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Yussanne Ma
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Marco Marra
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | | |
Collapse
|
40
|
Jones SJ, Haulena M, Taylor GA, Chan S, Bilobram S, Warren RL, Hammond SA, Mungall KL, Choo C, Kirk H, Pandoh P, Ally A, Dhalla N, Tam AKY, Troussard A, Paulino D, Coope RJN, Mungall AJ, Moore R, Zhao Y, Birol I, Ma Y, Marra M, Jones SJM. The Genome of the Northern Sea Otter (Enhydra lutris kenyoni). Genes (Basel) 2017; 8:genes8120379. [PMID: 29232880 PMCID: PMC5748697 DOI: 10.3390/genes8120379] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2017] [Revised: 11/28/2017] [Accepted: 12/01/2017] [Indexed: 11/21/2022] Open
Abstract
The northern sea otter inhabits coastal waters of the northern Pacific Ocean and is the largest member of the Mustelidae family. DNA sequencing methods that utilize microfluidic partitioned and non-partitioned library construction were used to establish the sea otter genome. The final assembly provided 2.426 Gbp of highly contiguous assembled genomic sequences with a scaffold N50 length of over 38 Mbp. We generated transcriptome data derived from a lymphoma to aid in the determination of functional elements. The assembled genome sequence and underlying sequence data are available at the National Center for Biotechnology Information (NCBI) under the BioProject accession number PRJNA388419.
Collapse
Affiliation(s)
- Samantha J Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | | | - Gregory A Taylor
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Simon Chan
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Steven Bilobram
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Karen L Mungall
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Caleb Choo
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Heather Kirk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Pawan Pandoh
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Adrian Ally
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Noreen Dhalla
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Angela K Y Tam
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Armelle Troussard
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Daniel Paulino
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Robin J N Coope
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Andrew J Mungall
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Richard Moore
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Yongjun Zhao
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Yussanne Ma
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
| | - Marco Marra
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC V5Z 4E6, Canada.
- Department of Medical Genetics, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.
| |
Collapse
|
41
|
Hammond SA, Warren RL, Vandervalk BP, Kucuk E, Khan H, Gibb EA, Pandoh P, Kirk H, Zhao Y, Jones M, Mungall AJ, Coope R, Pleasance S, Moore RA, Holt RA, Round JM, Ohora S, Walle BV, Veldhoen N, Helbing CC, Birol I. The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA. Nat Commun 2017; 8:1433. [PMID: 29127278 PMCID: PMC5681567 DOI: 10.1038/s41467-017-01316-7] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Accepted: 09/07/2017] [Indexed: 12/16/2022] Open
Abstract
Frogs play important ecological roles, and several species are important model organisms for scientific research. The globally distributed Ranidae (true frogs) are the largest frog family, and have substantial evolutionary distance from the model laboratory Xenopus frog species. Unfortunately, there are currently no genomic resources for the former, important group of amphibians. More widely applicable amphibian genomic data is urgently needed as more than two-thirds of known species are currently threatened or are undergoing population declines. We report a 5.8 Gbp (NG50 = 69 kbp) genome assembly of a representative North American bullfrog (Rana [Lithobates] catesbeiana). The genome contains over 22,000 predicted protein-coding genes and 6,223 candidate long noncoding RNAs (lncRNAs). RNA-Seq experiments show thyroid hormone causes widespread transcriptional change among protein-coding and putative lncRNA genes. This initial bullfrog draft genome will serve as a key resource with broad utility including amphibian research, developmental biology, and environmental research.
Collapse
Affiliation(s)
- S Austin Hammond
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Erdi Kucuk
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Hamza Khan
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Ewan A Gibb
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Pawan Pandoh
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Heather Kirk
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Yongjun Zhao
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Martin Jones
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Andrew J Mungall
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Robin Coope
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Stephen Pleasance
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Richard A Moore
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Robert A Holt
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6
| | - Jessica M Round
- Department of Biochemistry and Microbiology, University of Victoria, Petch Bldg Room 207, 3800 Finnerty Road, Victoria, BC, Canada, V8P 5C2
| | - Sara Ohora
- Department of Biochemistry and Microbiology, University of Victoria, Petch Bldg Room 207, 3800 Finnerty Road, Victoria, BC, Canada, V8P 5C2
| | - Branden V Walle
- Department of Biochemistry and Microbiology, University of Victoria, Petch Bldg Room 207, 3800 Finnerty Road, Victoria, BC, Canada, V8P 5C2
| | - Nik Veldhoen
- Department of Biochemistry and Microbiology, University of Victoria, Petch Bldg Room 207, 3800 Finnerty Road, Victoria, BC, Canada, V8P 5C2
| | - Caren C Helbing
- Department of Biochemistry and Microbiology, University of Victoria, Petch Bldg Room 207, 3800 Finnerty Road, Victoria, BC, Canada, V8P 5C2.
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer Agency, 570 West 7th Ave - Suite 100, Vancouver, BC, Canada, V5Z 4S6.
| |
Collapse
|
42
|
|
43
|
Chu J, Mohamadi H, Warren RL, Yang C, Birol I. Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics 2017; 33:1261-1270. [PMID: 28003261 PMCID: PMC5408847 DOI: 10.1093/bioinformatics/btw811] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Accepted: 12/16/2016] [Indexed: 01/23/2023] Open
Abstract
Identifying overlaps between error-prone long reads, specifically those from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PB), is essential for certain downstream applications, including error correction and de novo assembly. Though akin to the read-to-reference alignment problem, read-to-read overlap detection is a distinct problem that can benefit from specialized algorithms that perform efficiently and robustly on high error rate long reads. Here, we review the current state-of-the-art read-to-read overlap tools for error-prone long reads, including BLASR, DALIGNER, MHAP, GraphMap and Minimap. These specialized bioinformatics tools differ not just in their algorithmic designs and methodology, but also in their robustness of performance on a variety of datasets, time and memory efficiency and scalability. We highlight the algorithmic features of these tools, as well as their potential issues and biases when utilizing any particular method. To supplement our review of the algorithms, we benchmarked these tools, tracking their resource needs and computational performance, and assessed the specificity and precision of each. In the versions of the tools tested, we observed that Minimap is the most computationally efficient, specific and sensitive method on the ONT datasets tested; whereas GraphMap and DALIGNER are the most specific and sensitive methods on the tested PB datasets. The concepts surveyed may apply to future sequencing technologies, as scalability is becoming more relevant with increased sequencing throughput. Contact cjustin@bcgsc.ca , ibirol@bcgsc.ca. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Justin Chu
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- To whom correspondence should be addressed. ,
| | - Hamid Mohamadi
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Chen Yang
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Inanç Birol
- University of British Columbia, Vancouver, BC, Canada
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- Simon Fraser University, Burnaby, BC, Canada
- To whom correspondence should be addressed. ,
| |
Collapse
|
44
|
Yang C, Chu J, Warren RL, Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization. Gigascience 2017; 6:1-6. [PMID: 28327957 PMCID: PMC5530317 DOI: 10.1093/gigascience/gix010] [Citation(s) in RCA: 106] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Revised: 01/12/2017] [Accepted: 02/21/2017] [Indexed: 01/19/2023] Open
Abstract
Background The MinION sequencing instrument from Oxford Nanopore Technologies (ONT) produces long read lengths from single-molecule sequencing - valuable features for detailed genome characterization. To realize the potential of this platform, a number of groups are developing bioinformatics tools tuned for the unique characteristics of its data. We note that these development efforts would benefit from a simulator software, the output of which could be used to benchmark analysis tools. Results Here, we introduce NanoSim, a fast and scalable read simulator that captures the technology-specific features of ONT data and allows for adjustments upon improvement of nanopore sequencing technology. The first step of NanoSim is read characterization, which provides a comprehensive alignment-based analysis and generates a set of read profiles serving as the input to the next step, the simulation stage. The simulation stage uses the model built in the previous step to produce in silico reads for a given reference genome. NanoSim is written in Python and R. The source files and manual are available at the Genome Sciences Centre website: http://www.bcgsc.ca/platform/bioinfo/software/nanosim. Conclusion In this work, we model the base-calling errors of ONT reads to inform the simulation of sequences with similar characteristics. We showcase the performance of NanoSim on publicly available datasets generated using the R7 and R7.3 chemistries and different sequencing kits and compare the resulting synthetic reads to those of other long-sequence simulators and experimental ONT reads. We expect NanoSim to have an enabling role in the field and benefit the development of scalable next-generation sequencing technologies for the long nanopore reads, including genome assembly, mutation detection, and even metagenomic analysis software.
Collapse
Affiliation(s)
- Chen Yang
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
- Falculty of Science, University of British Columbia, Vancouver, Canada
| | - Justin Chu
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
- Falculty of Science, University of British Columbia, Vancouver, Canada
| | - René L Warren
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
| | - Inanç Birol
- Canada’s Michael Smith Genome Science Centre, British Columbia Cancer Agency, 570 W 7th Avenue, V5Z 4S6 Vancouver, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, Canada
- School of Computer Science, Simon Fraser University, Burnaby, Canada
| |
Collapse
|
45
|
Coombe L, Warren RL, Jackman SD, Yang C, Vandervalk BP, Moore RA, Pleasance S, Coope RJ, Bohlmann J, Holt RA, Jones SJM, Birol I. Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics' GemCode Sequencing Data. PLoS One 2016; 11:e0163059. [PMID: 27632164 PMCID: PMC5025161 DOI: 10.1371/journal.pone.0163059] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2016] [Accepted: 09/01/2016] [Indexed: 11/19/2022] Open
Abstract
The linked read sequencing library preparation platform by 10X Genomics produces barcoded sequencing libraries, which are subsequently sequenced using the Illumina short read sequencing technology. In this new approach, long fragments of DNA are partitioned into separate micro-reactions, where the same index sequence is incorporated into each of the sequencing fragment inserts derived from a given long fragment. In this study, we exploited this property by using reads from index sequences associated with a large number of reads, to assemble the chloroplast genome of the Sitka spruce tree (Picea sitchensis). Here we report on the first Sitka spruce chloroplast genome assembled exclusively from P. sitchensis genomic libraries prepared using the 10X Genomics protocol. We show that the resulting 124,049 base pair long genome shares high sequence similarity with the related white spruce and Norway spruce chloroplast genomes, but diverges substantially from a previously published P. sitchensis- P. thunbergii chimeric genome. The use of reads from high-frequency indices enabled separation of the nuclear genome reads from that of the chloroplast, which resulted in the simplification of the de Bruijn graphs used at the various stages of assembly.
Collapse
Affiliation(s)
- Lauren Coombe
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - René L. Warren
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- * E-mail: (RW); (IB)
| | - Shaun D. Jackman
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Chen Yang
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Benjamin P. Vandervalk
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Richard A. Moore
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Stephen Pleasance
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Robin J. Coope
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada
| | - Robert A. Holt
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Steven J. M. Jones
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Inanc Birol
- Canada’s Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
- * E-mail: (RW); (IB)
| |
Collapse
|
46
|
Jackman SD, Warren RL, Gibb EA, Vandervalk BP, Mohamadi H, Chu J, Raymond A, Pleasance S, Coope R, Wildung MR, Ritland CE, Bousquet J, Jones SJM, Bohlmann J, Birol I. Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation. Genome Biol Evol 2015; 8:29-41. [PMID: 26645680 PMCID: PMC4758241 DOI: 10.1093/gbe/evv244] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.
Collapse
Affiliation(s)
- Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Ewan A Gibb
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Hamid Mohamadi
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Justin Chu
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Anthony Raymond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Stephen Pleasance
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Robin Coope
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada
| | - Mark R Wildung
- School of Molecular Biosciences, Washington State University
| | - Carol E Ritland
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada
| | - Jean Bousquet
- Department of Forest and Environmental Genomics, Université Laval, Québec, QC, Canada
| | - Steven J M Jones
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
| | - Joerg Bohlmann
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, Canada Michael Smith Laboratories, University of British Columbia, Vancouver, BC, Canada Department of Botany, University of British Columbia, Vancouver, BC, Canada
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, Canada Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada School of Computing Science, Simon Fraser University, Burnaby, BC, Canada Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
47
|
Vandervalk BP, Yang C, Xue Z, Raghavan K, Chu J, Mohamadi H, Jackman SD, Chiu R, Warren RL, Birol I. Konnector v2.0: pseudo-long reads from paired-end sequencing data. BMC Med Genomics 2015; 8 Suppl 3:S1. [PMID: 26399504 PMCID: PMC4582294 DOI: 10.1186/1755-8794-8-s3-s1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
Background Reading the nucleotides from two ends of a DNA fragment is called paired-end tag (PET) sequencing. When the fragment length is longer than the combined read length, there remains a gap of unsequenced nucleotides between read pairs. If the target in such experiments is sequenced at a level to provide redundant coverage, it may be possible to bridge these gaps using bioinformatics methods. Konnector is a local de novo assembly tool that addresses this problem. Here we report on version 2.0 of our tool. Results Konnector uses a probabilistic and memory-efficient data structure called Bloom filter to represent a k-mer spectrum - all possible sequences of length k in an input file, such as the collection of reads in a PET sequencing experiment. It performs look-ups to this data structure to construct an implicit de Bruijn graph, which describes (k-1) base pair overlaps between adjacent k-mers. It traverses this graph to bridge the gap between a given pair of flanking sequences. Conclusions Here we report the performance of Konnector v2.0 on simulated and experimental datasets, and compare it against other tools with similar functionality. We note that, representing k-mers with 1.5 bytes of memory on average, Konnector can scale to very large genomes. With our parallel implementation, it can also process over a billion bases on commodity hardware.
Collapse
|
48
|
Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJM, Birol I. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. Gigascience 2015; 4:35. [PMID: 26244089 PMCID: PMC4524009 DOI: 10.1186/s13742-015-0076-3] [Citation(s) in RCA: 121] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2015] [Accepted: 07/29/2015] [Indexed: 12/05/2022] Open
Abstract
Background Owing to the complexity of the assembly problem, we do not yet have complete genome sequences. The difficulty in assembling reads into finished genomes is exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. In this regard, established and emerging long read technologies show great promise, but their current associated higher error rates typically require computational base correction and/or additional bioinformatics pre-processing before they can be of value. Results We present LINKS, the Long Interval Nucleotide K-mer Scaffolder algorithm, a method that makes use of the sequence properties of nanopore sequence data and other error-containing sequence data, to scaffold high-quality genome assemblies, without the need for read alignment or base correction. Here, we show how the contiguity of an ABySS Escherichia coli K-12 genome assembly can be increased greater than five-fold by the use of beta-released Oxford Nanopore Technologies Ltd. long reads and how LINKS leverages long-range information in Saccharomyces cerevisiae W303 nanopore reads to yield assemblies whose resulting contiguity and correctness are on par with or better than that of competing applications. We also present the re-scaffolding of the colossal white spruce (Picea glauca) draft assembly (PG29, 20 Gbp) and demonstrate how LINKS scales to larger genomes. Conclusions This study highlights the present utility of nanopore reads for genome scaffolding in spite of their current limitations, which are expected to diminish as the nanopore sequencing technology advances. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts. Electronic supplementary material The online version of this article (doi:10.1186/s13742-015-0076-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- René L Warren
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Chen Yang
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Benjamin P Vandervalk
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Bahar Behsaz
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Albert Lagman
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Steven J M Jones
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| | - Inanç Birol
- BC Cancer Agency, Michael Smith Genome Sciences Centre, Vancouver, British Columbia V5Z 4S6 Canada
| |
Collapse
|
49
|
Abstract
BACKGROUND While next-generation sequencing technologies have made sequencing genomes faster and more affordable, deciphering the complete genome sequence of an organism remains a significant bioinformatics challenge, especially for large genomes. Low sequence coverage, repetitive elements and short read length make de novo genome assembly difficult, often resulting in sequence and/or fragment "gaps" - uncharacterized nucleotide (N) stretches of unknown or estimated lengths. Some of these gaps can be closed by re-processing latent information in the raw reads. Even though there are several tools for closing gaps, they do not easily scale up to processing billion base pair genomes. RESULTS Here we describe Sealer, a tool designed to close gaps within assembly scaffolds by navigating de Bruijn graphs represented by space-efficient Bloom filter data structures. We demonstrate how it scales to successfully close 50.8% and 13.8% of gaps in human (3 Gbp) and white spruce (20 Gbp) draft assemblies in under 30 and 27 h, respectively - a feat that is not possible with other leading tools with the breadth of data used in our study. CONCLUSION Sealer is an automated finishing application that uses the succinct Bloom filter representation of a de Bruijn graph to close gaps in draft assemblies, including that of very large genomes. We expect Sealer to have broad utility for finishing genomes across the tree of life, from bacterial genomes to large plant genomes and beyond. Sealer is available for download at https://github.com/bcgsc/abyss/tree/sealer-release.
Collapse
Affiliation(s)
- Daniel Paulino
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Benjamin P Vandervalk
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Anthony Raymond
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Shaun D Jackman
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada.
| | - Inanç Birol
- Canada's Michael Smith Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada. .,Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3N1, Canada.
| |
Collapse
|
50
|
Warren RL, Keeling CI, Yuen MMS, Raymond A, Taylor GA, Vandervalk BP, Mohamadi H, Paulino D, Chiu R, Jackman SD, Robertson G, Yang C, Boyle B, Hoffmann M, Weigel D, Nelson DR, Ritland C, Isabel N, Jaquish B, Yanchuk A, Bousquet J, Jones SJM, MacKay J, Birol I, Bohlmann J. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism. Plant J 2015; 83:189-212. [PMID: 26017574 DOI: 10.1111/tpj.12886] [Citation(s) in RCA: 120] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/24/2015] [Accepted: 05/15/2015] [Indexed: 05/21/2023]
Abstract
White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation.
Collapse
Affiliation(s)
- René L Warren
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Christopher I Keeling
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Macaire Man Saint Yuen
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Anthony Raymond
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Greg A Taylor
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Benjamin P Vandervalk
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Hamid Mohamadi
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Daniel Paulino
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Readman Chiu
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Shaun D Jackman
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Gordon Robertson
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Chen Yang
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
| | - Brian Boyle
- Department of Wood and Forest Sciences, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Margarete Hoffmann
- Max Planck Institute for Developmental Biology, Spemannstrasse 35, 72076, Tübingen, Germany
| | - Detlef Weigel
- Max Planck Institute for Developmental Biology, Spemannstrasse 35, 72076, Tübingen, Germany
| | - David R Nelson
- Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
| | - Carol Ritland
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Nathalie Isabel
- Natural Resources Canada, Laurentian Forestry Centre, Québec, QC, G1V 4C7, Canada
| | - Barry Jaquish
- British Columbia Ministry of Forests, Lands, and Natural Resource Operations, Victoria, BC, V8W 9C2, Canada
| | - Alvin Yanchuk
- British Columbia Ministry of Forests, Lands, and Natural Resource Operations, Victoria, BC, V8W 9C2, Canada
| | - Jean Bousquet
- Department of Wood and Forest Sciences, Université Laval, Québec, QC, G1V 0A6, Canada
| | - Steven J M Jones
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3N1, Canada
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - John MacKay
- Department of Wood and Forest Sciences, Université Laval, Québec, QC, G1V 0A6, Canada
- Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, OX1 3RB, UK
| | - Inanc Birol
- Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, BC, V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, V6H 3N1, Canada
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Joerg Bohlmann
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Department of Forest and Conservation Sciences, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| |
Collapse
|