1
|
Bai D, Chen T, Xun J, Ma C, Luo H, Yang H, Cao C, Cao X, Cui J, Deng Y, Deng Z, Dong W, Dong W, Du J, Fang Q, Fang W, Fang Y, Fu F, Fu M, Fu Y, Gao H, Ge J, Gong Q, Gu L, Guo P, Guo Y, Hai T, Liu H, He J, He Z, Hou H, Huang C, Ji S, Jiang C, Jiang G, Jiang L, Jin LN, Kan Y, Kang D, Kou J, Lam K, Li C, Li C, Li F, Li L, Li M, Li X, Li Y, Li Z, Liang J, Lin Y, Liu C, Liu D, Liu F, Liu J, Liu T, Liu T, Liu X, Liu Y, Liu B, Liu M, Lou W, Luan Y, Luo Y, Lv H, Ma T, Mai Z, Mo J, Niu D, Pan Z, Qi H, Shi Z, Song C, Sun F, Sun Y, Tian S, Wan X, Wang G, Wang H, Wang H, Wang H, Wang J, Wang J, Wang K, Wang L, Wang S, Wang X, Wang Y, Xiao Z, Xing H, Xu Y, Yan S, Yang L, Yang S, Yang Y, Yao X, Yousuf S, Yu H, Lei Y, Yuan Z, et alBai D, Chen T, Xun J, Ma C, Luo H, Yang H, Cao C, Cao X, Cui J, Deng Y, Deng Z, Dong W, Dong W, Du J, Fang Q, Fang W, Fang Y, Fu F, Fu M, Fu Y, Gao H, Ge J, Gong Q, Gu L, Guo P, Guo Y, Hai T, Liu H, He J, He Z, Hou H, Huang C, Ji S, Jiang C, Jiang G, Jiang L, Jin LN, Kan Y, Kang D, Kou J, Lam K, Li C, Li C, Li F, Li L, Li M, Li X, Li Y, Li Z, Liang J, Lin Y, Liu C, Liu D, Liu F, Liu J, Liu T, Liu T, Liu X, Liu Y, Liu B, Liu M, Lou W, Luan Y, Luo Y, Lv H, Ma T, Mai Z, Mo J, Niu D, Pan Z, Qi H, Shi Z, Song C, Sun F, Sun Y, Tian S, Wan X, Wang G, Wang H, Wang H, Wang H, Wang J, Wang J, Wang K, Wang L, Wang S, Wang X, Wang Y, Xiao Z, Xing H, Xu Y, Yan S, Yang L, Yang S, Yang Y, Yao X, Yousuf S, Yu H, Lei Y, Yuan Z, Zeng M, Zhang C, Zhang C, Zhang H, Zhang J, Zhang N, Zhang T, Zhang Y, Zhang Y, Zhang Z, Zhou M, Zhou Y, Zhu C, Zhu L, Zhu Y, Zhu Z, Zou H, Zuo A, Dong W, Wen T, Chen S, Li G, Gao Y, Liu Y. EasyMetagenome: A user-friendly and flexible pipeline for shotgun metagenomic analysis in microbiome research. IMETA 2025; 4:e70001. [PMID: 40027489 PMCID: PMC11865343 DOI: 10.1002/imt2.70001] [Show More Authors] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Accepted: 01/22/2025] [Indexed: 03/05/2025]
Abstract
Shotgun metagenomics has become a pivotal technology in microbiome research, enabling in-depth analysis of microbial communities at both the high-resolution taxonomic and functional levels. This approach provides valuable insights of microbial diversity, interactions, and their roles in health and disease. However, the complexity of data processing and the need for reproducibility pose significant challenges to researchers. To address these challenges, we developed EasyMetagenome, a user-friendly pipeline that supports multiple analysis methods, including quality control and host removal, read-based, assembly-based, and binning, along with advanced genome analysis. The pipeline also features customizable settings, comprehensive data visualizations, and detailed parameter explanations, ensuring its adaptability across a wide range of data scenarios. Looking forward, we aim to refine the pipeline by addressing host contamination issues, optimizing workflows for third-generation sequencing data, and integrating emerging technologies like deep learning and network analysis, to further enhance microbiome insights and data accuracy. EasyMetageonome is freely available at https://github.com/YongxinLiu/EasyMetagenome.
Collapse
Affiliation(s)
- Defeng Bai
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Tong Chen
- State Key Laboratory for Quality Ensurance and Sustainable Use of Dao‐di Herbs, National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical SciencesBeijingChina
| | - Jiani Xun
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Chuang Ma
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
- School of HorticultureAnhui Agricultural UniversityHefeiChina
| | - Hao Luo
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Haifei Yang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
- College of Life SciencesQingdao Agricultural UniversityQingdaoChina
| | - Chen Cao
- Key Laboratory for Bio‐Electromagnetic Environment and Advanced Medical Theranostics, School of Biomedical Engineering and InformaticsNanjing Medical UniversityNanjingJiangsuChina
| | - Xiaofeng Cao
- Center for Water and Ecology, State Key Joint Laboratory of Environment Simulation and Pollution Control, School of EnvironmentTsinghua UniversityBeijingChina
| | - Jianzhou Cui
- Immunology Translational Research Programme, Yong Loo Lin School of MedicineNational University of SingaporeSingaporeSingapore
| | - Yuan‐Ping Deng
- Research Center for Parasites and Vectors, College of Veterinary MedicineHunan Agricultural UniversityChangshaHunanChina
| | - Zhaochao Deng
- Institute of Marine Biology and Pharmacology, Ocean CollegeZhejiang UniversityZhoushanZhejiangChina
| | - Wenxin Dong
- Agro‐Environmental Protection InstituteMinistry of Agriculture and Rural AffairsTianjinChina
| | - Wenxue Dong
- Key Laboratory for Molecular Genetic Mechanisms and Intervention Research on High Altitude Disease of Tibet Autonomous Region, School of MedicineXizang Minzu UniversityXianyangChina
| | - Juan Du
- Karolinska Institutet, Department of Microbiology, Tumor and Cell BiologyStockholmSweden
| | - Qunkai Fang
- College of EnvironmentZhejiang University of TechnologyHangzhouChina
| | - Wei Fang
- College of Environmental and Resource SciencesZhejiang Agriculture and Forestry UniversityHangzhouChina
| | - Yue Fang
- The College of ForestryBeijing Forestry UniversityBeijingChina
| | - Fangtian Fu
- Department of Bioinformatics, Hangzhou VicrobX Biotech Co., LtdHangzhouZhejiangChina
| | - Min Fu
- Anhui Province Key Laboratory of Integrated Pest Management on Crops, College of Plant ProtectionAnhui Agricultural UniversityHefeiChina
| | - Yi‐Tian Fu
- Xiangya School of Basic MedicineCentral South UniversityChangshaHunanChina
| | - He Gao
- Institute of Microbiology,Guangdong Academy of SciencesGuangzhouGuangdongChina
| | - Jingping Ge
- Engineering Research Center of Agricultural Microbiology Technology, Ministry of Education, School of Life SciencesHeilongjiang UniversityHarbinChina
| | - Qinglong Gong
- College of Animal Science and TechnologyJilin Agricultural UniversityChangchunJilinChina
| | - Lunda Gu
- Sansure Biotech IncorporationChangshaHunanChina
| | - Peng Guo
- School of Food Science and BiologyHebei University of Science and TechnologyShijiazhuangHebeiChina
| | - Yuhao Guo
- Engineering Research Center of Agricultural Microbiology Technology, Ministry of Education, School of Life SciencesHeilongjiang UniversityHarbinChina
| | - Tang Hai
- School of Life SciencesShanxi Datong UniversityDatongChina
| | - Hao Liu
- Department of Health & Environmental SciencesXi'an Jiaotong‐Liverpool UniversitySuzhouJiangsuChina
| | - Jieqiang He
- College of HorticultureNorthwest A&F UniversityYanglingShaanxiChina
| | - Zi‐Yang He
- School of Agriculture, Food and Ecosystem Sciences, Faculty of ScienceThe University of MelbourneVICAustralia
| | - Huiyu Hou
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Can Huang
- Graduate School of Frontier SciencesThe University of TokyoKashiwa‐shi, ChibaJapan
| | - Shuai Ji
- Institute of Biotechnology, Helsinki Institute of Life ScienceUniversity of HelsinkiHelsinkiFinland
| | | | - Gui‐Lai Jiang
- Suzhou Medical CollegeSoochow UniversitySuzhouJiangsuChina
| | - Lingjuan Jiang
- Biomarker Discovery and Validation Facility, Institute of Clinical Medicine, Peking Union Medical College HospitalBeijingChina
| | - Ling N. Jin
- Department of Civil and Environmental EngineeringThe Hong Kong Polytechnic UniversityHong KongChina
| | - Yuhe Kan
- College of Biology and OceanographyWeifang UniversityWeifangShandongChina
| | - Da Kang
- College of Environmental Science and EngineeringBeijing University of TechnologyBeijingChina
| | - Jin Kou
- College of Environmental and Municipal EngineeringLanzhou Jiaotong UniversityLanzhouChina
| | - Ka‐Lung Lam
- School of Life SciencesThe Chinese University of Hong KongShatin, Hong KongChina
| | - Changchao Li
- Department of Civil and Environmental EngineeringThe Hong Kong Polytechnic UniversityHong KongChina
| | - Chong Li
- Department of Renewable ResourcesUniversity of AlbertaEdmontonAlbertaCanada
| | - Fuyi Li
- School of Geographical SciencesNortheast Normal UniversityChangchunJilinChina
| | - Liwei Li
- Department of GastroenterologyThe Second Affiliated Hospital of Guangxi Medical UniversityNanningGuangxiChina
| | - Miao Li
- Synaura Biotechnology (Shanghai) Co., Ltd.ShanghaiChina
| | - Xin Li
- School of Public HealthUniversity of MichiganAnn ArborMichiganUSA
| | - Ye Li
- Institute of Soil Science, Chinese Academy of SciencesNanjingJiangsuChina
| | - Zheng‐Tao Li
- School of Art and Archaeology of Zhejiang UniversityZhejiangChina
| | - Jing Liang
- College of Animal Science and TechnologyGuangxi UniversityNanningChina
| | - Yongxin Lin
- Fujian Provincial Key Laboratory for Subtropical Resources and EnvironmentFujian Normal UniversityFuzhouChina
| | - Changzhen Liu
- College of Energy and Environmental EngineeringHebei University of EngineeringHandanHebeiChina
| | | | - Fengqin Liu
- College of Life SciencesHenan Agricultural UniversityZhengzhouChina
| | - Jia Liu
- College of Life ScienceNankai UniversityTianjinChina
| | - Tianrui Liu
- Jiangxi Province Key Laboratory of Sustainable Utilization of Traditional Chinese Medicine Resources, Institute of Traditional Chinese Medicine Health Industry, China Academy of Chinese Medical SciencesJiangxiChina
| | - Tingting Liu
- Beijing Key Laboratory of Emerging Infectious Diseases, Institute of Infectious Diseases, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Xinyuan Liu
- State Key Laboratory of Tea Plant Biology and UtilizationAnhui Agricultural UniversityHefeiAnhuiChina
| | - Yaqun Liu
- School of Life Sciences and Food TechnologyHanshan Normal UniversityChaozhouChina
| | | | - Minghao Liu
- State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of SciencesBeijingChina
| | - Wenbo Lou
- College of Animal Science and TechnologyJilin Agricultural UniversityChangchunJilinChina
| | - Yaning Luan
- The College of ForestryBeijing Forestry UniversityBeijingChina
| | - Yuanyuan Luo
- State Key Laboratory of Tea Plant Biology and UtilizationAnhui Agricultural UniversityHefeiAnhuiChina
| | - Hujie Lv
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
- Department of Life Sciences, Imperial College of LondonLondonUK
| | - Tengfei Ma
- State Key Laboratory of Herbage Improvement and Grassland Agro‐Ecosystems, Centre for Grassland Microbiome, College of Pastoral Agriculture Science and TechnologyLanzhou UniversityLanzhouGansuChina
| | - Zongjiong Mai
- Department of OncologyThe Fifth Affiliated Hospital of Sun Yat‐sen UniversityZhuhaiGuangdongChina
| | - Jiayuan Mo
- College of Animal Science and TechnologyGuangxi UniversityNanningChina
| | - Dongze Niu
- National‐Local Joint Engineering Research Center of Biomass Refining and High‐Quality Utilization, Institute of Urban and Rural MiningChangzhou UniversityChangzhouJiangsuChina
| | - Zhuo Pan
- Department of PathologyAffiliated Cancer Hospital of Zhengzhou UniversityZhengzhouChina
| | - Heyuan Qi
- Institute of Microbiology, Chinese Academy of SciencesBeijingChina
| | - Zhanyao Shi
- College of Water SciencesBeijing Normal UniversityBeijingChina
| | | | - Fuxiang Sun
- New Direction Biotechnology (Tianjin) Co., LtdTianjinChina
| | - Yan Sun
- College of Energy and Environmental Engineering, Hebei Key Laboratory of Air Pollution Cause and ImpactHebei University of EngineeringHandanChina
| | - Sihui Tian
- Institute of Botany, Chinese Academy of SciencesBeijingChina
| | - Xiulin Wan
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Guoliang Wang
- Institute of Biotechnology, Beijing Academy of Agriculture and Forestry SciencesBeijingChina
| | - Hongyang Wang
- National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical SciencesJiangsuChina
| | - Hongyu Wang
- College of Animal ScienceAnhui Science and Technology UniversityChuzhouChina
| | - Huanhuan Wang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Institute of Plant Protection, Chinese Academy of Agricultural SciencesBeijingChina
| | - Jing Wang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental SciencesBeijingChina
| | - Jun Wang
- China CDC Key Laboratory of Environment and Population Health, National Institute of Environmental Health, Chinese Center for Disease Control and PreventionBeijingChina
| | - Kang Wang
- College of Animal Science and TechnologyYangzhou UniversityYangzhouJiangsuChina
| | - Leli Wang
- Key Laboratory of Agro‐Ecological Processes in Subtropical Region, Institute of Subtropical Agriculture, Chinese Academy of SciencesChangshaChina
| | - Shao‐kun Wang
- Institute of Ecological Conservation and Restoration, Chinese Academy of ForestryBeijingChina
| | - Xinlong Wang
- Beijing Key Laboratory of Emerging Infectious Diseases, Institute of Infectious Diseases, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Yao Wang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Zufei Xiao
- State Key Laboratory for Ecological Security of Regions and Cities, Institute of Urban Environment, Chinese Academy of SciencesXiamenChina
| | - Huichun Xing
- Center of Liver Diseases Division 3, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Yifan Xu
- Center of Liver Diseases Division 3, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Shu‐yan Yan
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Key Laboratory of Invasive Alien Species Control of Ministry of Agriculture and Rural Affairs, Institute of Plant Protection, Chinese Academy of Agricultural SciencesBeijingChina
| | - Li Yang
- Sansure Biotech IncorporationChangshaHunanChina
| | - Song Yang
- Center of Liver Diseases Division 3, Beijing Ditan HospitalCapital Medical UniversityBeijingChina
| | - Yuanming Yang
- Guangzhou University of Chinese MedicineGuangzhouChina
| | - Xiaofang Yao
- Key Laboratory of Agro‐Ecological Processes in Subtropical Region, Institute of Subtropical Agriculture, Chinese Academy of SciencesChangshaChina
| | - Salsabeel Yousuf
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Hao Yu
- Institute of Marine Biology and Pharmacology, Ocean CollegeZhejiang UniversityZhoushanZhejiangChina
| | - Yu Lei
- Key Laboratory of Livestock BiologyNorthwest A&F UniversityYanglingShaanxiChina
| | - Zhengrong Yuan
- College of Biological Sciences and TechnologyBeijing Forestry UniversityBeijingChina
| | - Meiyin Zeng
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Chunfang Zhang
- Institute of Marine Biology and Pharmacology, Ocean CollegeZhejiang UniversityZhoushanZhejiangChina
| | - Chunge Zhang
- CAS Key Laboratory of Pathogenic Microbiology and Immunology, Institute of Microbiology, Chinese Academy of SciencesBeijingChina
| | - Huimin Zhang
- School of Food Science and TechnologyShihezi UniversityShiheziXinjiangChina
| | | | - Na Zhang
- College of Biochemical EngineeringBeijing Union UniversityBeijingChina
| | - Tianyuan Zhang
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Yi‐Bo Zhang
- State Key Laboratory for Biology of Plant Diseases and Insect Pests, Key Laboratory of Invasive Alien Species Control of Ministry of Agriculture and Rural Affairs, Institute of Plant Protection, Chinese Academy of Agricultural SciencesBeijingChina
| | - Yupeng Zhang
- College of Resources and Environmental SciencesHenan Agricultural UniversityZhengzhouChina
| | - Zheng Zhang
- Tea Research Institute, Chinese Academy of Agricultural SciencesHangzhouZhejiangChina
| | - Mingda Zhou
- College of Environmental Science and EngineeringTongji UniversityShanghaiChina
| | - Yuanping Zhou
- Zhanjiang Key Laboratory of Human Microecology and Clinical Translation Research, the Marine Biomedical Research Institute, College of Basic MedicineGuangdong Medical UniversityZhanjiangGuangdongChina
| | - Chengshuai Zhu
- School of Art and Archaeology of Zhejiang UniversityZhejiangChina
| | - Lin Zhu
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of TechnologyHarbinChina
| | - Yue Zhu
- School of Ecology, Environment and ResourcesGuangdong University of TechnologyGuangzhouGuangdongChina
| | - Zhihao Zhu
- Zhanjiang Key Laboratory of Human Microecology and Clinical Translation Research, the Marine Biomedical Research Institute, College of Basic MedicineGuangdong Medical UniversityZhanjiangGuangdongChina
| | - Hongqin Zou
- Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural SciencesBeijingChina
| | - Anna Zuo
- School of Traditional Chinese MedicineSouthern Medical UniversityGuangzhouGuangdongChina
| | - Wenxuan Dong
- Department of Animal SciencesPurdue UniversityWest LafayetteIndianaUSA
| | - Tao Wen
- College of Resource and Environmental SciencesNanjing Agricultural UniversityNanjingJiangsuChina
| | - Shifu Chen
- HaploX BiotechnologyShenzhenChina
- LifeX Institute, School of Medical TechnologyGannan Medical UniversityGanzhouChina
- Faculty of Data ScienceCity University of MacauMacauChina
| | - Guoliang Li
- Jiangxi Provincial Key Laboratory of Conservation Biology, College of ForestryJiangxi Agricultural UniversityNanchangJiangxiChina
| | - Yunyun Gao
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| | - Yong‐Xin Liu
- Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural SciencesShenzhenGuangdongChina
| |
Collapse
|
2
|
Haile S, Corbett RD, O’Neill K, Xu J, Smailus DE, Pandoh PK, Bayega A, Bala M, Chuah E, Coope RJN, Moore RA, Mungall KL, Zhao Y, Ma Y, Marra MA, Jones SJM, Mungall AJ. Adaptable and comprehensive approaches for long-read nanopore sequencing of polyadenylated and non-polyadenylated RNAs. Front Genet 2024; 15:1466338. [PMID: 39687742 PMCID: PMC11647301 DOI: 10.3389/fgene.2024.1466338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 11/11/2024] [Indexed: 12/18/2024] Open
Abstract
The advent of long-read (LR) sequencing technologies has provided a direct opportunity to determine the structure of transcripts with potential for end-to-end sequencing of full-length RNAs. LR methods that have been described to date include commercial offerings from Oxford Nanopore Technologies (ONT) and Pacific Biosciences. These kits are based on selection of polyadenylated (polyA+) RNAs and/or oligo-dT priming of reverse transcription. Thus, these approaches do not allow comprehensive interrogation of the transcriptome due to their exclusion of non-polyadenylated (polyA-) RNAs. In addition, polyA + specificity also results in 3'-biased measurements of PolyA+ RNAs especially when the RNA input is partially degraded. To address these limitations of current LR protocols, we modified rRNA depletion protocols that have been used in short-read sequencing: one approach representing a ligation-based method and the other a template-switch cDNA synthesis-based method to append ONT-specific adaptor sequences and by removing any deliberate fragmentation/shearing of RNA/cDNA. Here, we present comparisons with poly+ RNA-specific versions of the two approaches including the ONT PCR-cDNA Barcoding kit. The rRNA depletion protocols displayed higher proportions (30%-50%) of intronic content compared to that of the polyA-specific protocols (5%-8%). In addition, the rRNA depletion protocols enabled ∼20-50% higher detection of expressed genes. Other metrics that were favourable to the rRNA depletion protocols include better coverage of long transcripts, and higher accuracy and reproducibility of expression measurements. Overall, these results indicate that the rRNA depletion-based protocols described here allow the comprehensive characterization of polyadenylated and non-polyadenylated RNAs. While the resulting reads are long enough to help decipher transcript structures, future endeavors are warranted to improve the proportion of individual reads representing end-to-end spanning of transcripts.
Collapse
Affiliation(s)
- Simon Haile
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Richard D. Corbett
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Kieran O’Neill
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Jing Xu
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Duane E. Smailus
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Pawan K. Pandoh
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Anthony Bayega
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Miruna Bala
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Eric Chuah
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Robin J. N. Coope
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Richard A. Moore
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Karen L. Mungall
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Yongjun Zhao
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Yussanne Ma
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| | - Marco A. Marra
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Steven J. M. Jones
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
- Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| | - Andrew J. Mungall
- Canada’s Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, Canada
| |
Collapse
|
3
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
4
|
Pektas A, Panitz F, Thomsen B. TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation. Genes (Basel) 2024; 15:1547. [PMID: 39766814 PMCID: PMC11727683 DOI: 10.3390/genes15121547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 11/28/2024] [Accepted: 11/28/2024] [Indexed: 01/15/2025] Open
Abstract
Background/Objectives: Transcriptome assembly and functional annotation are essential in understanding gene expression and biological function. Nevertheless, many existing pipelines lack the flexibility to integrate both short- and long-read sequencing data or fail to provide a complete, customizable workflow for transcriptome analysis, particularly for non-model organisms. Methods: We present TrAnnoScope, a transcriptome analysis pipeline designed to process Illumina short-read and PacBio long-read data. The pipeline provides a complete, customizable workflow to generate high-quality, full-length (FL) transcripts with broad functional annotation. Its modular design allows users to adapt specific analysis steps for other sequencing platforms or data types. The pipeline encompasses steps from quality control to functional annotation, employing tools and established databases such as SwissProt, Pfam, Gene Ontology (GO), the Kyoto Encyclopedia of Genes and Genomes (KEGG), and Eukaryotic Orthologous Groups (KOG). As a case study, TrAnnoScope was applied to RNA-Seq and Iso-Seq data from zebra finch brain, ovary, and testis tissue. Results: The zebra finch transcriptome generated by TrAnnoScope from the brain, ovary, and testis tissue demonstrated strong alignment with the reference genome (99.63%), and it was found that 93.95% of the matched protein sequences in the zebra finch proteome were captured as nearly complete. Functional annotation provided matches to known protein databases and assigned relevant functional terms to the majority of the transcripts. Conclusions: TrAnnoScope successfully integrates short and long sequencing technologies to generate transcriptomes with minimal user input. Its modularity and ease of use make it a valuable tool for researchers analyzing complex datasets, particularly for non-model organisms.
Collapse
Affiliation(s)
- Aysevil Pektas
- Department of Molecular Biology and Genetics, Aarhus University, 8000 Aarhus, Denmark; (A.P.); (F.P.)
| | - Frank Panitz
- Department of Molecular Biology and Genetics, Aarhus University, 8000 Aarhus, Denmark; (A.P.); (F.P.)
- Applied Statistical Methods, Natural Resources Institute Finland (Luke), 20520 Turku, Finland
| | - Bo Thomsen
- Department of Molecular Biology and Genetics, Aarhus University, 8000 Aarhus, Denmark; (A.P.); (F.P.)
| |
Collapse
|
5
|
Kang X, Zhang W, Li Y, Luo X, Schönhuth A. HyLight: Strain aware assembly of low coverage metagenomes. Nat Commun 2024; 15:8665. [PMID: 39375348 PMCID: PMC11458758 DOI: 10.1038/s41467-024-52907-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 09/23/2024] [Indexed: 10/09/2024] Open
Abstract
Different strains of identical species can vary substantially in terms of their spectrum of biomedically relevant phenotypes. Reconstructing the genomes of microbial communities at the level of their strains poses significant challenges, because sequencing errors can obscure strain-specific variants. Next-generation sequencing (NGS) reads are too short to resolve complex genomic regions. Third-generation sequencing (TGS) reads, although longer, are prone to higher error rates or substantially more expensive. Limiting TGS coverage to reduce costs compromises the accuracy of the assemblies. This explains why prior approaches agree on losses in strain awareness, accuracy, tendentially excessive costs, or combinations thereof. We introduce HyLight, a metagenome assembly approach that addresses these challenges by implementing the complementary strengths of TGS and NGS data. HyLight employs strain-resolved overlap graphs (OG) to accurately reconstruct individual strains within microbial communities. Our experiments demonstrate that HyLight produces strain-aware and contiguous assemblies at minimal error content, while significantly reducing costs because utilizing low-coverage TGS data. HyLight achieves an average improvement of 19.05% in preserving strain identity and demonstrates near-complete strain awareness across diverse datasets. In summary, HyLight offers considerable advances in metagenome assembly, insofar as it delivers significantly enhanced strain awareness, contiguity, and accuracy without the typical compromises observed in existing approaches.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
6
|
Kumari P, Kaur M, Dindhoria K, Ashford B, Amarasinghe SL, Thind AS. Advances in long-read single-cell transcriptomics. Hum Genet 2024; 143:1005-1020. [PMID: 38787419 PMCID: PMC11485027 DOI: 10.1007/s00439-024-02678-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Accepted: 05/07/2024] [Indexed: 05/25/2024]
Abstract
Long-read single-cell transcriptomics (scRNA-Seq) is revolutionizing the way we profile heterogeneity in disease. Traditional short-read scRNA-Seq methods are limited in their ability to provide complete transcript coverage, resolve isoforms, and identify novel transcripts. The scRNA-Seq protocols developed for long-read sequencing platforms overcome these limitations by enabling the characterization of full-length transcripts. Long-read scRNA-Seq techniques initially suffered from comparatively poor accuracy compared to short read scRNA-Seq. However, with improvements in accuracy, accessibility, and cost efficiency, long-reads are gaining popularity in the field of scRNA-Seq. This review details the advances in long-read scRNA-Seq, with an emphasis on library preparation protocols and downstream bioinformatics analysis tools.
Collapse
Affiliation(s)
- Pallawi Kumari
- Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
| | - Manmeet Kaur
- Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
| | - Kiran Dindhoria
- Institute of Microbial Technology, Council of Scientific and Industrial Research, Chandigarh, India
| | - Bruce Ashford
- Illawarra Shoalhaven Local Health District (ISLHD), NSW Health, Wollongong, NSW, Australia
| | - Shanika L Amarasinghe
- Monash Biomedical Discovery Institute, Monash University, Clayton, VIC, 3800, Australia
- Walter and Eliza Hall Institute of Medical Research, 1G, Royal Parade, Parkville, VIC, 3025, Australia
| | - Amarinder Singh Thind
- Illawarra Shoalhaven Local Health District (ISLHD), NSW Health, Wollongong, NSW, Australia.
- The School of Chemistry and Molecular Bioscience (SCMB), University of Wollongong, Loftus St, Wollongong, NSW, 2500, Australia.
| |
Collapse
|
7
|
Liang Q, Yu T, Kofman E, Jagannatha P, Rhine K, Yee BA, Corbett KD, Yeo GW. High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues. Nat Commun 2024; 15:7067. [PMID: 39152130 PMCID: PMC11329496 DOI: 10.1038/s41467-024-50363-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 07/09/2024] [Indexed: 08/19/2024] Open
Abstract
RNA-binding proteins (RBPs) have pivotal functions in RNA metabolism, but current methods are limited in retrieving RBP-RNA interactions within endogenous biological contexts. Here, we develop INSCRIBE (IN situ Sensitive Capture of RNA-protein Interactions in Biological Environments), circumventing the challenges through in situ RNA labeling by precisely directing a purified APOBEC1-nanobody fusion to the RBP of interest. This method enables highly specific RNA-binding site identification across a diverse range of fixed biological samples such as HEK293T cells and mouse brain tissue and accurately identifies the canonical binding motifs of RBFOX2 (UGCAUG) and TDP-43 (UGUGUG) in native cellular environments. Applicable to any RBP with available primary antibodies, INSCRIBE enables sensitive capture of RBP-RNA interactions from ultra-low input equivalent to ~5 cells. The robust, versatile, and sensitive INSCRIBE workflow is particularly beneficial for precious tissues such as clinical samples, empowering the exploration of genuine RBP-RNA interactions in RNA-related disease contexts.
Collapse
Affiliation(s)
- Qishan Liang
- Department of Chemistry and Biochemistry, University of California San Diego, La Jolla, CA, USA
- Center for RNA Technologies and Therapeutics, University of California San Diego, La Jolla, CA, USA
| | - Tao Yu
- Center for RNA Technologies and Therapeutics, University of California San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Sanford Stem Cell Institute and Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Eric Kofman
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Sanford Stem Cell Institute and Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - Pratibha Jagannatha
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Sanford Stem Cell Institute and Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - Kevin Rhine
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Sanford Stem Cell Institute and Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Brian A Yee
- Center for RNA Technologies and Therapeutics, University of California San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
- Sanford Stem Cell Institute and Stem Cell Program, University of California San Diego, La Jolla, CA, USA
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kevin D Corbett
- Center for RNA Technologies and Therapeutics, University of California San Diego, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA.
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA.
| | - Gene W Yeo
- Center for RNA Technologies and Therapeutics, University of California San Diego, La Jolla, CA, USA.
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA.
- Sanford Stem Cell Institute and Stem Cell Program, University of California San Diego, La Jolla, CA, USA.
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA.
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
8
|
Bhowmik O, Rahman T, Kalyanaraman A. Maptcha: an efficient parallel workflow for hybrid genome scaffolding. BMC Bioinformatics 2024; 25:263. [PMID: 39118013 PMCID: PMC11313021 DOI: 10.1186/s12859-024-05878-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 07/22/2024] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. RESULTS In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences-an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a ⟨ contig,contig ⟩ graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic "wiring" heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds. CONCLUSIONS Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( 1 × - 10 × ).
Collapse
Affiliation(s)
- Oieswarya Bhowmik
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164, USA.
| | - Tazin Rahman
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164, USA
| | - Ananth Kalyanaraman
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164, USA
| |
Collapse
|
9
|
Shen Y, Liu N, Wang Z. Recent advances in the culture-independent discovery of natural products using metagenomic approaches. Chin J Nat Med 2024; 22:100-111. [PMID: 38342563 DOI: 10.1016/s1875-5364(24)60585-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Indexed: 02/13/2024]
Abstract
Natural products derived from bacterial sources have long been pivotal in the discovery of drug leads. However, the cultivation of only about 1% of bacteria in laboratory settings has left a significant portion of biosynthetic diversity hidden within the genomes of uncultured bacteria. Advances in sequencing technologies now enable the exploration of genetic material from these metagenomes through culture-independent methods. This approach involves extracting genetic sequences from environmental DNA and applying a hybrid methodology that combines functional screening, sequence tag-based homology screening, and bioinformatic-assisted chemical synthesis. Through this process, numerous valuable natural products have been identified and synthesized from previously uncharted metagenomic territories. This paper provides an overview of the recent advancements in the utilization of culture-independent techniques for the discovery of novel biosynthetic gene clusters and bioactive small molecules within metagenomic libraries.
Collapse
Affiliation(s)
- Yiping Shen
- Laboratory of Microbial Drug Discovery, China Pharmaceutical University, Nanjing 211198, China
| | - Nan Liu
- Laboratory of Microbial Drug Discovery, China Pharmaceutical University, Nanjing 211198, China
| | - Zongqiang Wang
- Laboratory of Microbial Drug Discovery, China Pharmaceutical University, Nanjing 211198, China.
| |
Collapse
|
10
|
Zong L, Zhu Y, Jiang Y, Xia Y, Liu Q, Wang J, Gao S, Luo B, Yuan Y, Zhou J, Jiang S. An optimized workflow of full-length transcriptome sequencing for accurate fusion transcript identification. RNA Biol 2024; 21:122-131. [PMID: 39540613 PMCID: PMC11572239 DOI: 10.1080/15476286.2024.2425527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 10/23/2024] [Accepted: 10/25/2024] [Indexed: 11/16/2024] Open
Abstract
Next-generation sequencing has revolutionized cancer genomics by enabling high-throughput mutation screening yet detecting fusion genes reliably remains challenging. Long-read sequencing offers potential for accurate fusion transcript identification, though challenges persist. In this study, we present an optimized workflow using nanopore sequencing technology to precisely identify fusion transcripts. Our approach encompasses a tailored library preparation protocol, data processing, and fusion gene analysis pipeline. We evaluated the performance using Universal Human Reference RNA and human adenocarcinoma cell lines. Our optimized nanopore sequencing workflow generated high-quality full-length transcriptome data characterized by an extended length distribution and comprehensive transcript coverage. Validation experiments confirmed novel fusion events with potential clinical relevance. Our protocol aims to mitigate biases and enhance accuracy, facilitating increased adoption in clinical diagnostics. Continued advancements in long-read sequencing promise deeper insights into fusion gene biology and improved cancer diagnostics.
Collapse
Affiliation(s)
- Liang Zong
- Department of Biology and Genetics, College of Life Sciences and Health, Wuhan University of Science and Technology, Wuhan, China
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Yabing Zhu
- BGI Tech Solutions Co. Ltd., BGI-Shenzhen, Shenzhen, China
| | - Yuan Jiang
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Ying Xia
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Qun Liu
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Jing Wang
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Song Gao
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Bei Luo
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Yongxian Yuan
- BGI Tech Solutions Co. Ltd., BGI-Shenzhen, Shenzhen, China
| | - Jingjiao Zhou
- Department of Biology and Genetics, College of Life Sciences and Health, Wuhan University of Science and Technology, Wuhan, China
| | - Sanjie Jiang
- BGI Tech Solutions Co. Ltd., BGI-Shenzhen, Shenzhen, China
| |
Collapse
|
11
|
Kang X, Xu J, Luo X, Schönhuth A. Hybrid-hybrid correction of errors in long reads with HERO. Genome Biol 2023; 24:275. [PMID: 38041098 PMCID: PMC10690975 DOI: 10.1186/s13059-023-03112-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 11/16/2023] [Indexed: 12/03/2023] Open
Abstract
Although generally superior, hybrid approaches for correcting errors in third-generation sequencing (TGS) reads, using next-generation sequencing (NGS) reads, mistake haplotype-specific variants for errors in polyploid and mixed samples. We suggest HERO, as the first "hybrid-hybrid" approach, to make use of both de Bruijn graphs and overlap graphs for optimal catering to the particular strengths of NGS and TGS reads. Extensive benchmarking experiments demonstrate that HERO improves indel and mismatch error rates by on average 65% (27[Formula: see text]95%) and 20% (4[Formula: see text]61%). Using HERO prior to genome assembly significantly improves the assemblies in the majority of the relevant categories.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
12
|
Chen J, Xu F. Application of Nanopore Sequencing in the Diagnosis and Treatment of Pulmonary Infections. Mol Diagn Ther 2023; 27:685-701. [PMID: 37563539 PMCID: PMC10590290 DOI: 10.1007/s40291-023-00669-8] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/18/2023] [Indexed: 08/12/2023]
Abstract
This review provides an in-depth discussion of the development, principles and utility of nanopore sequencing technology and its diverse applications in the identification of various pulmonary pathogens. We examined the emergence and advancements of nanopore sequencing as a significant player in this field. We illustrate the challenges faced in diagnosing mixed infections and further scrutinize the use of nanopore sequencing in the identification of single pathogens, including viruses (with a focus on its use in epidemiology, outbreak investigation, and viral resistance), bacteria (emphasizing 16S targeted sequencing, rare bacterial lung infections, and antimicrobial resistance studies), fungi (employing internal transcribed spacer sequencing), tuberculosis, and atypical pathogens. Furthermore, we discuss the role of nanopore sequencing in metagenomics and its potential for unbiased detection of all pathogens in a clinical setting, emphasizing its advantages in sequencing genome repeat areas and structural variant regions. We discuss the limitations in dealing with host DNA removal, the inherent high error rate of nanopore sequencing technology, along with the complexity of operation and processing, while acknowledging the possibilities provided by recent technological improvements. We compared nanopore sequencing with the BioFire system, a rapid molecular diagnostic system based on polymerase chain reaction. Although the BioFire system serves well for the rapid screening of known and common pathogens, it falls short in the identification of unknown or rare pathogens and in providing comprehensive genome analysis. As technological advancements continue, it is anticipated that the role of nanopore sequencing technology in diagnosing and treating lung infections will become increasingly significant.
Collapse
Affiliation(s)
- Jie Chen
- Department of Infectious Diseases, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310009, Zhejiang, China
| | - Feng Xu
- Department of Infectious Diseases, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, 310009, Zhejiang, China.
| |
Collapse
|
13
|
Yu PL, Fulton JC, Hudson OH, Huguet-Tapia JC, Brawner JT. Next-generation fungal identification using target enrichment and Nanopore sequencing. BMC Genomics 2023; 24:581. [PMID: 37784013 PMCID: PMC10544392 DOI: 10.1186/s12864-023-09691-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 09/21/2023] [Indexed: 10/04/2023] Open
Abstract
BACKGROUND Rapid and accurate pathogen identification is required for disease management. Compared to sequencing entire genomes, targeted sequencing may be used to direct sequencing resources to genes of interest for microbe identification and mitigate the low resolution that single-locus molecular identification provides. This work describes a broad-spectrum fungal identification tool developed to focus high-throughput Nanopore sequencing on genes commonly employed for disease diagnostics and phylogenetic inference. RESULTS Orthologs of targeted genes were extracted from 386 reference genomes of fungal species spanning six phyla to identify homologous regions that were used to design the baits used for enrichment. To reduce the cost of producing probes without diminishing the phylogenetic power, DNA sequences were first clustered, and then consensus sequences within each cluster were identified to produce 26,000 probes that targeted 114 genes. To test the efficacy of our probes, we applied the technique to three species representing Ascomycota and Basidiomycota fungi. The efficiency of enrichment, quantified as mean target coverage over the mean genome-wide coverage, ranged from 200 to 300. Furthermore, enrichment of long reads increased the depth of coverage across the targeted genes and into non-coding flanking sequence. The assemblies generated from enriched samples provided well-resolved phylogenetic trees for taxonomic assignment and molecular identification. CONCLUSIONS Our work provides data to support the utility of targeted Nanopore sequencing for fungal identification and provides a platform that may be extended for use with other phytopathogens.
Collapse
Affiliation(s)
- Pei-Ling Yu
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | - James C Fulton
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
- Florida Department of Agriculture and Consumer Services, Division of Plant Industry, Gainesville, FL, 32608, USA
| | - Owen H Hudson
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | - Jose C Huguet-Tapia
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA
| | - Jeremy T Brawner
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611, USA.
| |
Collapse
|
14
|
Lu N, Qiao Y, An P, Luo J, Bi C, Li M, Lu Z, Tu J. Exploration of whole genome amplification generated chimeric sequences in long-read sequencing data. Brief Bioinform 2023; 24:bbad275. [PMID: 37529913 DOI: 10.1093/bib/bbad275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/21/2023] [Accepted: 07/10/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Multiple displacement amplification (MDA) has become the most commonly used method of whole genome amplification, generating a vast amount of DNA with higher molecular weight and greater genome coverage. Coupling with long-read sequencing, it is possible to sequence the amplicons of over 20 kb in length. However, the formation of chimeric sequences (chimeras, expressed as structural errors in sequencing data) in MDA seriously interferes with the bioinformatics analysis but its influence on long-read sequencing data is unknown. RESULTS We sequenced the phi29 DNA polymerase-mediated MDA amplicons on the PacBio platform and analyzed chimeras within the generated data. The 3rd-ChimeraMiner has been constructed as a pipeline for recognizing and restoring chimeras into the original structures in long-read sequencing data, improving the efficiency of using TGS data. Five long-read datasets and one high-fidelity long-read dataset with various amplification folds were analyzed. The result reveals that the mis-priming events in amplification are more frequently occurring than widely perceived, and the propor tion gradually accumulates from 42% to over 78% as the amplification continues. In total, 99.92% of recognized chimeric sequences were demonstrated to be artifacts, whose structures were wrongly formed in MDA instead of existing in original genomes. By restoring chimeras to their original structures, the vast majority of supplementary alignments that introduce false-positive structural variants are recycled, removing 97% of inversions on average and contributing to the analysis of structural variation in MDA-amplified samples. The impact of chimeras in long-read sequencing data analysis should be emphasized, and the 3rd-ChimeraMiner can help to quantify and reduce the influence of chimeras. AVAILABILITY AND IMPLEMENTATION The 3rd-ChimeraMiner is available on GitHub, https://github.com/dulunar/3rdChimeraMiner.
Collapse
Affiliation(s)
- Na Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Yi Qiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Pengfei An
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
- Monash University-Southeast University Joint Research Institute, Suzhou 215123, China
| | - Jiajian Luo
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Changwei Bi
- College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China
| | - Musheng Li
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89511, USA
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Jing Tu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| |
Collapse
|
15
|
Liu Z, Du Y, Sun Z, Cheng B, Bi Z, Yao Z, Liang Y, Zhang H, Yao R, Kang S, Shi Y, Wan H, Qin D, Xiang L, Leng L, Chen S. Manual correction of genome annotation improved alternative splicing identification of Artemisia annua. PLANTA 2023; 258:83. [PMID: 37721598 DOI: 10.1007/s00425-023-04237-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 09/04/2023] [Indexed: 09/19/2023]
Abstract
Gene annotation is essential for genome-based studies. However, algorithm-based genome annotation is difficult to fully and correctly reveal genomic information, especially for species with complex genomes. Artemisia annua L. is the only commercial resource of artemisinin production though the content of artemisinin is still to be improved. Genome-based genetic modification and breeding are useful strategies to boost artemisinin content and therefore, ensure the supply of artemisinin and reduce costs, but better gene annotation is urgently needed. In this study, we manually corrected the newly released genome annotation of A. annua using second- and third-generation transcriptome data. We found that incorrect gene information may lead to differences in structural, functional, and expression levels compared to the original expectations. We also identified alternative splicing events and found that genome annotation information impacted identifying alternative splicing genes. We further demonstrated that genome annotation information and alternative splicing could affect gene expression estimation and gene function prediction. Finally, we provided a valuable version of A. annua genome annotation and demonstrated the importance of gene annotation in future research.
Collapse
Affiliation(s)
- Zhaoyu Liu
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine, Tianjin, 300193, China
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
| | - Yupeng Du
- College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Zhihao Sun
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
| | - Bohan Cheng
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Zenghao Bi
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
| | - Zhicheng Yao
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Yuting Liang
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
| | - Huiling Zhang
- College of Horticulture, Sichuan Agricultural University, Chengdu, 611130, China
| | - Run Yao
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
| | - Shen Kang
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
| | - Yuhua Shi
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Huihua Wan
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Dou Qin
- Prescription Laboratory of Xinjiang Traditional Uyghur Medicine, Xinjiang Institute of Traditional Uyghur Medicine, Urmuqi, 830000, China
| | - Li Xiang
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, 100700, China.
- Prescription Laboratory of Xinjiang Traditional Uyghur Medicine, Xinjiang Institute of Traditional Uyghur Medicine, Urmuqi, 830000, China.
| | - Liang Leng
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China.
| | - Shilin Chen
- School of Chinese Materia Medica, Tianjin University of Traditional Chinese Medicine, Tianjin, 300193, China.
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China.
| |
Collapse
|
16
|
Yang C, Lo T, Nip KM, Hafezqorani S, Warren RL, Birol I. Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim. Gigascience 2023; 12:giad013. [PMID: 36939007 PMCID: PMC10025935 DOI: 10.1093/gigascience/giad013] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 01/19/2023] [Accepted: 02/17/2023] [Indexed: 03/21/2023] Open
Abstract
BACKGROUND Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. RESULTS Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. CONCLUSIONS The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.
Collapse
Affiliation(s)
- Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Life Sciences Centre Room 1364 – 2350 Health Science Mall Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
17
|
Benchmarking machine learning robustness in Covid-19 genome sequence classification. Sci Rep 2023; 13:4154. [PMID: 36914815 PMCID: PMC10010240 DOI: 10.1038/s41598-023-31368-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/10/2023] [Indexed: 03/16/2023] Open
Abstract
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome-millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
Collapse
|
18
|
Mak QXC, Wick RR, Holt JM, Wang JR. Polishing De Novo Nanopore Assemblies of Bacteria and Eukaryotes With FMLRC2. Mol Biol Evol 2023; 40:7069220. [PMID: 36869750 PMCID: PMC10015616 DOI: 10.1093/molbev/msad048] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 01/20/2023] [Accepted: 02/21/2023] [Indexed: 03/05/2023] Open
Abstract
As the accuracy and throughput of nanopore sequencing improve, it is increasingly common to perform long-read first de novo genome assemblies followed by polishing with accurate short reads. We briefly introduce FMLRC2, the successor to the original FM-index Long Read Corrector (FMLRC), and illustrate its performance as a fast and accurate de novo assembly polisher for both bacterial and eukaryotic genomes.
Collapse
Affiliation(s)
- Q X Charles Mak
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Ryan R Wick
- Centre for Pathogen Genomics, University of Melbourne, Melbourne, Australia
| | | | - Jeremy R Wang
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
19
|
Srinivas M, O’Sullivan O, Cotter PD, van Sinderen D, Kenny JG. The Application of Metagenomics to Study Microbial Communities and Develop Desirable Traits in Fermented Foods. Foods 2022; 11:3297. [PMID: 37431045 PMCID: PMC9601669 DOI: 10.3390/foods11203297] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 10/11/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022] Open
Abstract
The microbial communities present within fermented foods are diverse and dynamic, producing a variety of metabolites responsible for the fermentation processes, imparting characteristic organoleptic qualities and health-promoting traits, and maintaining microbiological safety of fermented foods. In this context, it is crucial to study these microbial communities to characterise fermented foods and the production processes involved. High Throughput Sequencing (HTS)-based methods such as metagenomics enable microbial community studies through amplicon and shotgun sequencing approaches. As the field constantly develops, sequencing technologies are becoming more accessible, affordable and accurate with a further shift from short read to long read sequencing being observed. Metagenomics is enjoying wide-spread application in fermented food studies and in recent years is also being employed in concert with synthetic biology techniques to help tackle problems with the large amounts of waste generated in the food sector. This review presents an introduction to current sequencing technologies and the benefits of their application in fermented foods.
Collapse
Affiliation(s)
- Meghana Srinivas
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- School of Microbiology, University College Cork, T12 CY82 Cork, Ireland
| | - Orla O’Sullivan
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- VistaMilk SFI Research Centre, Fermoy, P61 C996 Cork, Ireland
| | - Paul D. Cotter
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- VistaMilk SFI Research Centre, Fermoy, P61 C996 Cork, Ireland
| | - Douwe van Sinderen
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- School of Microbiology, University College Cork, T12 CY82 Cork, Ireland
| | - John G. Kenny
- Food Biosciences Department, Teagasc Food Research Centre, Moorepark, P61 C996 Cork, Ireland
- APC Microbiome Ireland, University College Cork, T12 CY82 Cork, Ireland
- VistaMilk SFI Research Centre, Fermoy, P61 C996 Cork, Ireland
| |
Collapse
|
20
|
Banchi E, Manna V, Fonti V, Fabbro C, Celussi M. Improving environmental monitoring of Vibrionaceae in coastal ecosystems through 16S rRNA gene amplicon sequencing. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2022; 29:67466-67482. [PMID: 36056283 PMCID: PMC9492620 DOI: 10.1007/s11356-022-22752-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
The Vibrionaceae family groups genetically and metabolically diverse bacteria thriving in all marine environments. Despite often representing a minor fraction of bacterial assemblages, members of this family can exploit a wide variety of nutritional sources, which makes them important players in biogeochemical dynamics. Furthermore, several Vibrionaceae species are well-known pathogens, posing a threat to human and animal health. Here, we applied the phylogenetic placement coupled with a consensus-based approach using 16S rRNA gene amplicon sequencing, aiming to reach a reliable and fine-level Vibrionaceae characterization and identify the dynamics of blooming, ecologically important, and potentially pathogenic species in different sites of the northern Adriatic Sea. Water samples were collected monthly at a Long-Term Ecological Research network site from 2018 to 2021, and in spring and summer of 2019 and 2020 at two sites affected by depurated sewage discharge. The 41 identified Vibrionaceae species represented generally below 1% of the sampled communities; blooms (up to ~ 11%) mainly formed by Vibrio chagasii and Vibrio owensii occurred in summer, linked to increasing temperature and particulate matter concentration. Pathogenic species such as Vibrio anguilllarum, Vibrio tapetis, and Photobacterium damselae were found in low abundance. Depuration plant samples were characterized by a lower abundance and diversity of Vibrionaceae species compared to seawater, highlighting that Vibrionaceae dynamics at sea are unlikely to be related to wastewater inputs. Our work represents a further step to improve the molecular approach based on short reads, toward a shared, updated, and curated phylogeny of the Vibrionaceae family.
Collapse
Affiliation(s)
- Elisa Banchi
- National Institute of Oceanography and Applied Geophysics - OGS, Via A. Piccard, 54, 34151, Trieste, Italy.
| | - Vincenzo Manna
- National Institute of Oceanography and Applied Geophysics - OGS, Via A. Piccard, 54, 34151, Trieste, Italy
| | - Viviana Fonti
- National Institute of Oceanography and Applied Geophysics - OGS, Via A. Piccard, 54, 34151, Trieste, Italy
| | - Cinzia Fabbro
- National Institute of Oceanography and Applied Geophysics - OGS, Via A. Piccard, 54, 34151, Trieste, Italy
| | - Mauro Celussi
- National Institute of Oceanography and Applied Geophysics - OGS, Via A. Piccard, 54, 34151, Trieste, Italy
| |
Collapse
|
21
|
Core Genome Multilocus Sequence Typing Scheme for Improved Characterization and Epidemiological Surveillance of Pathogenic Brucella. J Clin Microbiol 2022; 60:e0031122. [PMID: 35852343 PMCID: PMC9387271 DOI: 10.1128/jcm.00311-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Brucellosis poses a significant burden to human and animal health worldwide. Robust and harmonized molecular epidemiological approaches and population studies that include routine disease screening are needed to efficiently track the origin and spread of Brucella strains. Core genome multilocus sequence typing (cgMLST) is a powerful genotyping system commonly used to delineate pathogen transmission routes for disease surveillance and control. Except for Brucella melitensis, cgMLST schemes for Brucella species are currently not established. Here, we describe a novel cgMLST scheme that covers multiple Brucella species. We first determined the phylogenetic breadth of the genus using 612 Brucella genomes. We selected 1,764 genes that were particularly well conserved and typeable in at least 98% of these genomes. We tested the new scheme on 600 genomes and found high agreement with the whole-genome-based single nucleotide polymorphism (SNP) analysis. Next, we applied the scheme to reanalyze the genome of Brucella strains from epidemiologically linked outbreaks. We demonstrated the applicability of the new scheme for high-resolution typing required in outbreak investigations as previously reported with whole-genome SNP methods. We also used the novel scheme to define the global population structure of the genus using 1,322 Brucella genomes. Finally, we demonstrated the possibility of tracing distribution of Brucella strains by performing cluster analysis of cgMLST profiles and found nearly identical cgMLST profiles in different countries. Our results show that sequencing depth of more than 40-fold is optimal for allele calling with this scheme. In summary, this study describes a novel Brucella-wide cgMLST scheme that is applicable in Brucella molecular epidemiology and helps in accurately tracking and thus controlling the sources of infection. The scheme is publicly accessible and should represent a valuable resource for laboratories with limited computational resources and bioinformatics expertise.
Collapse
|
22
|
Ye S, Yu X, Chen H, Zhang Y, Wu Q, Tan H, Song J, Saqib HSA, Farhadi A, Ikhwanuddin M, Ma H. Full-Length Transcriptome Reconstruction Reveals the Genetic Mechanisms of Eyestalk Displacement and Its Potential Implications on the Interspecific Hybrid Crab (Scylla serrata ♀ × S. paramamosain ♂). BIOLOGY 2022; 11:biology11071026. [PMID: 36101407 PMCID: PMC9312322 DOI: 10.3390/biology11071026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Revised: 06/26/2022] [Accepted: 06/27/2022] [Indexed: 11/30/2022]
Abstract
Simple Summary The eyestalk is a key organ in crustaceans that produces neurohormones and regulates a range of physiological functions. Eyestalk displacement was discovered in some first-generation (F1) offspring of the novel interspecific hybrid crab (Scylla serrata ♀ × S. paramamosain ♂). To uncover the genetic mechanism underlying eyestalk displacement and its potential implications, high-quality transcriptome was reconstructed using single-molecule real-time (SMRT) sequencing. A total of 37 significantly differential alternative splicing (DAS) events (17 up-regulated and 20 down-regulated) and 1475 significantly differential expressed transcripts (DETs) (492 up-regulated and 983 down-regulated) were detected in hybrid crabs with displaced eyestalks (DH). The most significant DAS events and DETs were annotated as being endoplasmic reticulum chaperone BiP and leucine-rich repeat protein lrrA-like isoform X2. In addition, the top ten significant gene ontology (GO) terms were related to the cuticle or chitin. Overall, this study highlights the underlying genetic mechanisms of eyestalk displacement and provide useful knowledge for mud crab (Scylla spp.) crossbreeding. Abstract The lack of high-quality juvenile crabs is the greatest impediment to the growth of the mud crab (Scylla paramamosain) industry. To obtain high-quality hybrid offspring, a novel hybrid mud crab (S. serrata ♀ × S. paramamosain ♂) was successfully produced in our previous study. Meanwhile, an interesting phenomenon was discovered, that some first-generation (F1) hybrid offspring’s eyestalks were displaced during the crablet stage I. To uncover the genetic mechanism underlying eyestalk displacement and its potential implications, both single-molecule real-time (SMRT) and Illumina RNA sequencing were implemented. Using a two-step collapsing strategy, three high-quality reconstructed transcriptomes were obtained from purebred mud crabs (S. paramamosain) with normal eyestalks (SPA), hybrid crabs with normal eyestalks (NH), and hybrid crabs with displaced eyestalks (DH). In total, 37 significantly differential alternative splicing (DAS) events (17 up-regulated and 20 down-regulated) and 1475 significantly differential expressed transcripts (DETs) (492 up-regulated and 983 down-regulated) were detected in DH. The most significant DAS events and DETs were annotated as being endoplasmic reticulum chaperone BiP and leucine-rich repeat protein lrrA-like isoform X2. In addition, the top ten significant GO terms were related to the cuticle or chitin. Overall, high-quality reconstructed transcriptomes were obtained for the novel interspecific hybrid crab and provided valuable insights into the genetic mechanisms of eyestalk displacement in mud crab (Scylla spp.) crossbreeding.
Collapse
Affiliation(s)
- Shaopan Ye
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Xiaoyan Yu
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Huiying Chen
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Yin Zhang
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Qingyang Wu
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Huaqiang Tan
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Jun Song
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Hafiz Sohaib Ahmed Saqib
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Ardavan Farhadi
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
| | - Mhd Ikhwanuddin
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
- Institute of Tropical Aquaculture and Fisheries, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu 21030, Malaysia
| | - Hongyu Ma
- Guangdong Provincial Key Laboratory of Marine Biotechnology, Shantou University, Shantou 515063, China; (S.Y.); (X.Y.); (H.C.); (Y.Z.); (Q.W.); (H.T.); (J.S.); (H.S.A.S.); (A.F.)
- STU-UMT Joint Shellfish Research Laboratory, Shantou University, Shantou 515063, China;
- Institute of Tropical Aquaculture and Fisheries, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu 21030, Malaysia
- Correspondence: ; Tel.: +86-754-86503471
| |
Collapse
|
23
|
Mc Cartney AM, Shafin K, Alonge M, Bzikadze AV, Formenti G, Fungtammasan A, Howe K, Jain C, Koren S, Logsdon GA, Miga KH, Mikheenko A, Paten B, Shumate A, Soto DC, Sović I, Wood JMD, Zook JM, Phillippy AM, Rhie A. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat Methods 2022; 19:687-695. [PMID: 35361931 PMCID: PMC9812399 DOI: 10.1038/s41592-022-01440-3] [Citation(s) in RCA: 63] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Accepted: 03/04/2022] [Indexed: 01/07/2023]
Abstract
Advances in long-read sequencing technologies and genome assembly methods have enabled the recent completion of the first telomere-to-telomere human genome assembly, which resolves complex segmental duplications and large tandem repeats, including centromeric satellite arrays in a complete hydatidiform mole (CHM13). Although derived from highly accurate sequences, evaluation revealed evidence of small errors and structural misassemblies in the initial draft assembly. To correct these errors, we designed a new repeat-aware polishing strategy that made accurate assembly corrections in large repeats without overcorrection, ultimately fixing 51% of the existing errors and improving the assembly quality value from 70.2 to 73.9 measured from PacBio high-fidelity and Illumina k-mers. By comparing our results to standard automated polishing tools, we outline common polishing errors and offer practical suggestions for genome projects with limited resources. We also show how sequencing biases in both high-fidelity and Oxford Nanopore Technologies reads cause signature assembly errors that can be corrected with a diverse panel of sequencing technologies.
Collapse
Affiliation(s)
- Ann M Mc Cartney
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
| | - Kishwar Shafin
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Michael Alonge
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Andrey V Bzikadze
- Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, La Jolla, CA, USA
| | - Giulio Formenti
- Laboratory of Neurogenetics of Language and The Vertebrate Genome Lab, The Rockefeller University, New York, NY, USA
| | | | | | - Chirag Jain
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA
| | - Glennis A Logsdon
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Karen H Miga
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
| | - Alla Mikheenko
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Saint Petersburg, Russia
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Alaina Shumate
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Daniela C Soto
- Genome Center, MIND Institute, Department of Biochemistry and Molecular Medicine, University of California, Davis, CA, USA
| | - Ivan Sović
- Pacific Biosciences, Menlo Park, CA, USA
- Digital BioLogic d.o.o., Ivanić-Grad, Croatia
| | | | - Justin M Zook
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
| | - Arang Rhie
- Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI, NIH, Bethesda, MD, USA.
| |
Collapse
|
24
|
Intragenomic variation in nuclear ribosomal markers and its implication in species delimitation, identification and barcoding in fungi. FUNGAL BIOL REV 2022. [DOI: 10.1016/j.fbr.2022.04.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
25
|
Stefan CP, Hall AT, Graham AS, Minogue TD. Comparison of Illumina and Oxford Nanopore Sequencing Technologies for Pathogen Detection from Clinical Matrices Using Molecular Inversion Probes. J Mol Diagn 2022; 24:395-405. [PMID: 35085783 DOI: 10.1016/j.jmoldx.2021.12.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 11/19/2021] [Accepted: 12/22/2021] [Indexed: 11/16/2022] Open
Abstract
Next-generation sequencing is rapidly finding footholds in numerous microbiological fields, including infectious disease diagnostics. Here, we describe a molecular inversion probe panel for the identification of bacterial, viral, and parasitic pathogens. We describe the ability of Illumina and Oxford Nanopore Technologies (ONT) to sequence small amplicons originating from this panel for the identification of pathogens in complex matrices. The panel correctly classified 31 bacterial pathogens directly from positive blood culture bottles with a genus-level concordance of 96.7% and 90.3% on the Illumina and ONT platforms, respectively. Both sequencing platforms detected 18 viral and parasitic organisms directly from mock clinical samples of plasma and whole blood at concentrations of 104 PFU/mL with few exceptions. In general, Illumina sequencing exhibited greater read counts with lower percent mapped reads; however, this resulted in no effect on limits of detection compared with ONT sequencing. Mock clinical evaluation of the probe panel on the Illumina and ONT platforms resulted in positive predictive values of 0.91 and 0.88 and negative predictive values of 1 and 1 from de-identified human chikungunya virus samples compared with gold standard quantitative RT-PCR. Overall, these data show that molecular inversion probes are an adaptable technology capable of pathogen detection from complex sample matrices on current next-generation sequencing platforms.
Collapse
Affiliation(s)
- Christopher P Stefan
- Diagnostic Systems Division, United States Army Medical Research Institute of Infectious Disease, Fort Detrick, Maryland
| | - Adrienne T Hall
- Diagnostic Systems Division, United States Army Medical Research Institute of Infectious Disease, Fort Detrick, Maryland
| | - Amanda S Graham
- Diagnostic Systems Division, United States Army Medical Research Institute of Infectious Disease, Fort Detrick, Maryland
| | - Timothy D Minogue
- Diagnostic Systems Division, United States Army Medical Research Institute of Infectious Disease, Fort Detrick, Maryland.
| |
Collapse
|
26
|
Factors Affecting the Quality of Bacterial Genomes Assemblies by Canu after Nanopore Sequencing. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12063110] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Long-read sequencing (LRS), like Oxford Nanopore Technologies, is usually associated with higher error rates compared to previous generations. Factors affecting the assembly quality are the integrity of DNA, the flowcell efficiency, and, not least all, the raw data processing. Among LRS-intended de novo assemblers, Canu is highly flexible, with its dozens of adjustable parameters. Different Canu parameters were compared for assembling reads of Salmonellaenterica ser. Bovismorbificans (genome size of 4.8 Mbp) from three runs on MinION (N50 651, 805, and 5573). Two of them, with low quality and highly fragmented DNA, were not usable alone for assembly, while they were successfully assembled when combining the reads from all experiments. The best results were obtained by modifying Canu parameters related to the error correction, such as corErrorRate (exclusion of overlaps above a set error rate, set up at 0.40), corMhapSensitivity (the coarse sensitivity level, set to “high”), corMinCoverage (set to 0 to correct all reads, regardless the overlaps length), and corOutCoverage (corrects the longest reads up to the imposed coverage, set to 100). This setting produced two contigs corresponding to the complete sequences of the chromosome and a plasmid. The overall results highlight the importance of a tailored bioinformatic analysis.
Collapse
|
27
|
Sharma A, Jain P, Mahgoub A, Zhou Z, Mahadik K, Chaterji S. Lerna: transformer architectures for configuring error correction tools for short- and long-read genome sequencing. BMC Bioinformatics 2022; 23:25. [PMID: 34991450 PMCID: PMC8734100 DOI: 10.1186/s12859-021-04547-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 12/20/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are both tool- and dataset-dependent. Moreover, evaluating the performance (i.e., Alignment-rate or Gain) of a given tool usually relies on a reference genome, but quality reference genomes are not always available. We introduce Lerna for the automated configuration of k-mer-based EC tools. Lerna first creates a language model (LM) of the uncorrected genomic reads, and then, based on this LM, calculates a metric called the perplexity metric to evaluate the corrected reads for different parameter choices. Next, it finds the one that produces the highest alignment rate without using a reference genome. The fundamental intuition of our approach is that the perplexity metric is inversely correlated with the quality of the assembly after error correction. Therefore, Lerna leverages the perplexity metric for automated tuning of k-mer sizes without needing a reference genome. RESULTS First, we show that the best k-mer value can vary for different datasets, even for the same EC tool. This motivates our design that automates k-mer size selection without using a reference genome. Second, we show the gains of our LM using its component attention-based transformers. We show the model's estimation of the perplexity metric before and after error correction. The lower the perplexity after correction, the better the k-mer size. We also show that the alignment rate and assembly quality computed for the corrected reads are strongly negatively correlated with the perplexity, enabling the automated selection of k-mer values for better error correction, and hence, improved assembly quality. We validate our approach on both short and long reads. Additionally, we show that our attention-based models have significant runtime improvement for the entire pipeline-18[Formula: see text] faster than previous works, due to parallelizing the attention mechanism and the use of JIT compilation for GPU inferencing. CONCLUSION Lerna improves de novo genome assembly by optimizing EC tools. Our code is made available in a public repository at: https://github.com/icanforce/lerna-genomics .
Collapse
Affiliation(s)
| | - Pranjal Jain
- Indian Institute of Technology Bombay, Mumbai, India
| | | | | | | | | |
Collapse
|
28
|
Bartalucci N, Romagnoli S, Vannucchi AM. A blood drop through the pore: nanopore sequencing in hematology. Trends Genet 2021; 38:572-586. [PMID: 34906378 DOI: 10.1016/j.tig.2021.11.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 11/09/2021] [Accepted: 11/15/2021] [Indexed: 10/19/2022]
Abstract
The development of new sequencing platforms, technologies, and bioinformatics tools in the past decade fostered key discoveries in human genomics. Among the most recent sequencing technologies, nanopore sequencing (NS) has caught the interest of researchers for its intriguing potential and flexibility. This up-to-date review highlights the recent application of NS in the hematology field, focusing on progress and challenges of the technological approaches employed for the identification of pathologic alterations. The molecular and analytic pipelines developed for the analysis of the whole-genome, target regions, and transcriptomics provide a proof of evidence of the unparalleled amount of information that could be retrieved by an innovative approach based on long-read sequencing.
Collapse
Affiliation(s)
- Niccolò Bartalucci
- CRIMM, Center of Research and Innovation of Myeloproliferative Neoplasms, Careggi University Hospital and Department of Experimental and Clinical Medicine, University of Florence, DENOTHE Excellence Center, Florence, Italy
| | - Simone Romagnoli
- CRIMM, Center of Research and Innovation of Myeloproliferative Neoplasms, Careggi University Hospital and Department of Experimental and Clinical Medicine, University of Florence, DENOTHE Excellence Center, Florence, Italy
| | - Alessandro Maria Vannucchi
- CRIMM, Center of Research and Innovation of Myeloproliferative Neoplasms, Careggi University Hospital and Department of Experimental and Clinical Medicine, University of Florence, DENOTHE Excellence Center, Florence, Italy.
| |
Collapse
|
29
|
Chen Z, He X. Application of third-generation sequencing in cancer research. MEDICAL REVIEW (BERLIN, GERMANY) 2021; 1:150-171. [PMID: 37724303 PMCID: PMC10388785 DOI: 10.1515/mr-2021-0013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/09/2021] [Indexed: 09/20/2023]
Abstract
In the past several years, nanopore sequencing technology from Oxford Nanopore Technologies (ONT) and single-molecule real-time (SMRT) sequencing technology from Pacific BioSciences (PacBio) have become available to researchers and are currently being tested for cancer research. These methods offer many advantages over most widely used high-throughput short-read sequencing approaches and allow the comprehensive analysis of transcriptomes by identifying full-length splice isoforms and several other posttranscriptional events. In addition, these platforms enable structural variation characterization at a previously unparalleled resolution and direct detection of epigenetic marks in native DNA and RNA. Here, we present a comprehensive summary of important applications of these technologies in cancer research, including the identification of complex structure variants, alternatively spliced isoforms, fusion transcript events, and exogenous RNA. Furthermore, we discuss the impact of the newly developed nanopore direct RNA sequencing (RNA-Seq) approach in advancing epitranscriptome research in cancer. Although the unique challenges still present for these new single-molecule long-read methods, they will unravel many aspects of cancer genome complexity in unprecedented ways and present an encouraging outlook for continued application in an increasing number of different cancer research settings.
Collapse
Affiliation(s)
- Zhiao Chen
- Fudan University Shanghai Cancer Center and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
| | - Xianghuo He
- Fudan University Shanghai Cancer Center and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, China
- Key Laboratory of Breast Cancer in Shanghai, Fudan University Shanghai Cancer Center, Fudan University, Shanghai, China
| |
Collapse
|
30
|
Comprehensive characterization of copy number variation (CNV) called from array, long- and short-read data. BMC Genomics 2021; 22:826. [PMID: 34789167 PMCID: PMC8596897 DOI: 10.1186/s12864-021-08082-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified. RESULTS We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology. CONCLUSIONS Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.
Collapse
|
31
|
Wu C, Yin Y, Zhu L, Zhang Y, Li YZ. Metagenomic sequencing-driven multidisciplinary approaches to shed light on the untapped microbial natural products. Drug Discov Today 2021; 27:730-742. [PMID: 34775105 DOI: 10.1016/j.drudis.2021.11.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 10/07/2021] [Accepted: 11/08/2021] [Indexed: 11/17/2022]
Abstract
The advantage of metagenomics over the culture-based natural product (NP) discovery pipeline is the ability to access the biosynthetic potential of uncultivable microbes. Advances in DNA sequencing are revolutionizing conventional metagenomics approaches for microbial NP discovery. The genomes of (in)cultivable bugs can be resolved straightforwardly from environmental samples, enabling in situ prediction of biosynthetic gene clusters (BGCs). The predicted chemical diversities could be realized not only by heterologous expression of gene clusters originating from DNA synthesis or direct cloning, but also potentially by bioinformatic-directed organic synthesis or chemoenzymatic total synthesis. In this review, we suggest that metagenomic sequencing in tandem with multidisciplinary approaches will form a versatile platform to shed light on a plethora of microbial 'dark matter'.
Collapse
Affiliation(s)
- Changsheng Wu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China.
| | - Yizhen Yin
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China
| | - Lele Zhu
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China
| | - Youming Zhang
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China
| | - Yue-Zhong Li
- State Key Laboratory of Microbial Technology, Institute of Microbial Technology, Shandong University, Qingdao 266237, China.
| |
Collapse
|
32
|
Sacristán-Horcajada E, González-de la Fuente S, Peiró-Pastor R, Carrasco-Ramiro F, Amils R, Requena JM, Berenguer J, Aguado B. ARAMIS: From systematic errors of NGS long reads to accurate assemblies. Brief Bioinform 2021; 22:bbab170. [PMID: 34013348 PMCID: PMC8574707 DOI: 10.1093/bib/bbab170] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 03/31/2021] [Accepted: 04/11/2021] [Indexed: 01/23/2023] Open
Abstract
NGS long-reads sequencing technologies (or third generation) such as Pacific BioSciences (PacBio) have revolutionized the sequencing field over the last decade improving multiple genomic applications like de novo genome assemblies. However, their error rate, mostly involving insertions and deletions (indels), is currently an important concern that requires special attention to be solved. Multiple algorithms are available to fix these sequencing errors using short reads (such as Illumina), although they require long processing times and some errors may persist. Here, we present Accurate long-Reads Assembly correction Method for Indel errorS (ARAMIS), the first NGS long-reads indels correction pipeline that combines several correction software in just one step using accurate short reads. As a proof OF concept, six organisms were selected based on their different GC content, size and genome complexity, and their PacBio-assembled genomes were corrected thoroughly by this pipeline. We found that the presence of systematic sequencing errors in long-reads PacBio sequences affecting homopolymeric regions, and that the type of indel error introduced during PacBio sequencing are related to the GC content of the organism. The lack of knowledge of this fact leads to the existence of numerous published studies where such errors have been found and should be resolved since they may contain incorrect biological information. ARAMIS yields better results with less computational resources needed than other correction tools and gives the possibility of detecting the nature of the found indel errors found and its distribution along the genome. The source code of ARAMIS is available at https://github.com/genomics-ngsCBMSO/ARAMIS.git.
Collapse
Affiliation(s)
| | | | - R Peiró-Pastor
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - F Carrasco-Ramiro
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - R Amils
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - J M Requena
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - J Berenguer
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| | - B Aguado
- Centro de Biología Molecular Severo Ochoa (CBMSO) (CSIC-UAM), Madrid, Spain
| |
Collapse
|
33
|
Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 2021; 39:1348-1365. [PMID: 34750572 PMCID: PMC8988251 DOI: 10.1038/s41587-021-01108-x] [Citation(s) in RCA: 804] [Impact Index Per Article: 201.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Accepted: 09/22/2021] [Indexed: 12/13/2022]
Abstract
Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications.
Collapse
Affiliation(s)
- Yunhao Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yue Zhao
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA
| | - Audrey Bollas
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Yuru Wang
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA
| | - Kin Fai Au
- Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
- Biomedical Informatics Shared Resources, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
34
|
Delahaye C, Nicolas J. Sequencing DNA with nanopores: Troubles and biases. PLoS One 2021; 16:e0257521. [PMID: 34597327 PMCID: PMC8486125 DOI: 10.1371/journal.pone.0257521] [Citation(s) in RCA: 253] [Impact Index Per Article: 63.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 09/06/2021] [Indexed: 12/03/2022] Open
Abstract
Oxford Nanopore Technologies' (ONT) long read sequencers offer access to longer DNA fragments than previous sequencer generations, at the cost of a higher error rate. While many papers have studied read correction methods, few have addressed the detailed characterization of observed errors, a task complicated by frequent changes in chemistry and software in ONT technology. The MinION sequencer is now more stable and this paper proposes an up-to-date view of its error landscape, using the most mature flowcell and basecaller. We studied Nanopore sequencing error biases on both bacterial and human DNA reads. We found that, although Nanopore sequencing is expected not to suffer from GC bias, it is a crucial parameter with respect to errors. In particular, low-GC reads have fewer errors than high-GC reads (about 6% and 8% respectively). The error profile for homopolymeric regions or regions with short repeats, the source of about half of all sequencing errors, also depends on the GC rate and mainly shows deletions, although there are some reads with long insertions. Another interesting finding is that the quality measure, although over-estimated, offers valuable information to predict the error rate as well as the abundance of reads. We supplemented this study with an analysis of a rapeseed RNA read set and shown a higher level of errors with a higher level of deletion in these data. Finally, we have implemented an open source pipeline for long-term monitoring of the error profile, which enables users to easily compute various analysis presented in this work, including for future developments of the sequencing device. Overall, we hope this work will provide a basis for the design of better error-correction methods.
Collapse
|
35
|
Lima L, Marchet C, Caboche S, Da Silva C, Istace B, Aury JM, Touzet H, Chikhi R. Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data. Brief Bioinform 2021; 21:1164-1181. [PMID: 31232449 DOI: 10.1093/bib/bbz058] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2018] [Revised: 04/05/2019] [Accepted: 04/22/2019] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and options for the error correction of Nanopore RNA-sequencing long reads remain limited. RESULTS In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting cDNA Nanopore reads. We provide an automatic and extensive benchmark tool that not only reports classical error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform and splice site detection. We find that long read error correction tools that were originally developed for DNA are also suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work provides guidelines on which (or whether) error correction tools should be used, depending on the application type. BENCHMARKING SOFTWARE https://gitlab.com/leoisl/LR_EC_analyser.
Collapse
Affiliation(s)
- Leandro Lima
- Univ Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Evolutive UMR Villeurbanne, France.,EPI ERABLE - Inria Grenoble, Rhône-Alpes, France.,Università di Roma 'Tor Vergata', Roma, Italy
| | | | - Ségolène Caboche
- Université de Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, UMR, Center for Infection and Immunity of Lille, Lille, France
| | - Corinne Da Silva
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Benjamin Istace
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Jean-Marc Aury
- Genoscope, Institut de biologie Francois-Jacob, Commissariat à l'Energie Atomique (CEA), Université Paris-Saclay, Evry, France
| | - Hélène Touzet
- CNRS, Université de Lille, CRIStAL UMR, Lille, France
| | - Rayan Chikhi
- CNRS, Université de Lille, CRIStAL UMR, Lille, France.,Institut Pasteur, C3BI - USR 3756, 25-28 rue du Docteur Roux, Paris, France
| |
Collapse
|
36
|
Chang T, An B, Liang M, Duan X, Du L, Cai W, Zhu B, Gao X, Chen Y, Xu L, Zhang L, Gao H, Li J. PacBio Single-Molecule Long-Read Sequencing Provides New Light on the Complexity of Full-Length Transcripts in Cattle. Front Genet 2021; 12:664974. [PMID: 34527015 PMCID: PMC8437344 DOI: 10.3389/fgene.2021.664974] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2021] [Accepted: 08/06/2021] [Indexed: 12/02/2022] Open
Abstract
Cattle (Bos taurus) is one of the most widely distributed livestock species in the world, and provides us with high-quality milk and meat which have a huge impact on the quality of human life. Therefore, accurate and complete transcriptome and genome annotation are of great value to the research of cattle breeding. In this study, we used error-corrected PacBio single-molecule real-time (SMRT) data to perform whole-transcriptome profiling in cattle. Then, 22.5 Gb of subreads was generated, including 381,423 circular consensus sequences (CCSs), among which 276,295 full-length non-chimeric (FLNC) sequences were identified. After correction by Illumina short reads, we obtained 22,353 error-corrected isoforms. A total of 305 alternative splicing (AS) events and 3,795 alternative polyadenylation (APA) sites were detected by transcriptome structural analysis. Furthermore, we identified 457 novel genes, 120 putative transcription factors (TFs), and 569 novel long non-coding RNAs (lncRNAs). Taken together, this research improves our understanding and provides new insights into the complexity of full-length transcripts in cattle.
Collapse
Affiliation(s)
- Tianpeng Chang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Bingxing An
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Mang Liang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xinghai Duan
- College of Animal Science and Technology, Southwest University, Chongqing, China
| | - Lili Du
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Wentao Cai
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Bo Zhu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Xue Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Yan Chen
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lingyang Xu
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lupei Zhang
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Huijiang Gao
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Junya Li
- Laboratory of Molecular Biology and Bovine Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| |
Collapse
|
37
|
Paloi S, Mhuantong W, Luangsa-ard JJ, Kobmoo N. Using High-Throughput Amplicon Sequencing to Evaluate Intragenomic Variation and Accuracy in Species Identification of Cordyceps Species. J Fungi (Basel) 2021; 7:767. [PMID: 34575804 PMCID: PMC8467230 DOI: 10.3390/jof7090767] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 09/06/2021] [Accepted: 09/10/2021] [Indexed: 12/30/2022] Open
Abstract
While recent sequencing technologies (third generation sequencing) can successfully sequence all copies of nuclear ribosomal DNA (rDNA) markers present within a genome and offer insights into the intragenomic variation of these markers, high intragenomic variation can be a source of confusion for high-throughput species identification using such technologies. High-throughput (HT) amplicon sequencing via PacBio SEQUEL I was used to evaluate the intragenomic variation of the ITS region and D1-D2 LSU domains in nine Cordyceps species, and the accuracy of such technology to identify these species based on molecular phylogenies was also assessed. PacBio sequences within strains showed variable level of intragenomic variation among the studied Cordyceps species with C. blackwelliae showing greater variation than the others. Some variants from a mix of species clustered together outside their respective species of origin, indicative of intragenomic variation that escaped concerted evolution shared between species. Proper selection of consensus sequences from HT amplicon sequencing is a challenge for interpretation of correct species identification. PacBio consensus sequences with the highest number of reads represent the major variants within a genome and gave the best results in terms of species identification.
Collapse
Affiliation(s)
| | | | | | - Noppol Kobmoo
- National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Development Agency (NSTDA), 113 Thailand Science Park, Phahonuyothin Rd., Khlong Nueng, Khlong Luang, Pathum Thani 12120, Thailand; (S.P.); (W.M.); (J.J.L.)
| |
Collapse
|
38
|
Galanti L, Shasha D, Gunsalus KC. Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing. BMC Bioinformatics 2021; 22:359. [PMID: 34215187 PMCID: PMC8254269 DOI: 10.1186/s12859-021-04267-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 06/14/2021] [Indexed: 11/24/2022] Open
Abstract
Background Systems biology increasingly relies on deep sequencing with combinatorial index tags to associate biological sequences with their sample, cell, or molecule of origin. Accurate data interpretation depends on the ability to classify sequences based on correct decoding of these combinatorial barcodes. The probability of correct decoding is influenced by both sequence quality and the number and arrangement of barcodes. The rising complexity of experimental designs calls for a probability model that accounts for both sequencing errors and random noise, generalizes to multiple combinatorial tags, and can handle any barcoding scheme. The needs for reproducibility and community benchmark standards demand a peer-reviewed tool that preserves decoding quality scores and provides tunable control over classification confidence that balances precision and recall. Moreover, continuous improvements in sequencing throughput require a fast, parallelized and scalable implementation. Results and discussion We developed a flexible, robustly engineered software that performs probabilistic decoding and supports arbitrarily complex barcoding designs. Pheniqs computes the full posterior decoding error probability of observed barcodes by consulting basecalling quality scores and prior distributions, and reports sequences and confidence scores in Sequence Alignment/Map (SAM) fields. The product of posteriors for multiple independent barcodes provides an overall confidence score for each read. Pheniqs achieves greater accuracy than minimum edit distance or simple maximum likelihood estimation, and it scales linearly with core count to enable the classification of > 11 billion reads in 1 h 15 m using < 50 megabytes of memory. Pheniqs has been in production use for seven years in our genomics core facility. Conclusion We introduce a computationally efficient software that implements both probabilistic and minimum distance decoders and show that decoding barcodes using posterior probabilities is more accurate than available methods. Pheniqs allows fine-tuning of decoding sensitivity using intuitive confidence thresholds and is extensible with alternative decoders and new error models. Any arbitrary arrangement of barcodes is easily configured, enabling computation of combinatorial confidence scores for any barcoding strategy. An optimized multithreaded implementation assures that Pheniqs is faster and scales better with complex barcode sets than existing tools. Support for POSIX streams and multiple sequencing formats enables easy integration with automated analysis pipelines. Supplementary Information The online version supplementary material available at 10.1186/s12859-021-04267-5.
Collapse
Affiliation(s)
- Lior Galanti
- Department of Biology, Center for Genomics and System Biology, New York University, New York, USA.,NYU Abu Dhabi Center for Genomics and System Biology, New York University, Abu Dhabi, United Arab Emirates
| | - Dennis Shasha
- Department of Computer Science, Courant Institute, New York University, New York, USA
| | - Kristin C Gunsalus
- Department of Biology, Center for Genomics and System Biology, New York University, New York, USA. .,NYU Abu Dhabi Center for Genomics and System Biology, New York University, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
39
|
Murigneux V, Roberts LW, Forde BM, Phan MD, Nhu NTK, Irwin AD, Harris PNA, Paterson DL, Schembri MA, Whiley DM, Beatson SA. MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial genome construction. BMC Genomics 2021; 22:474. [PMID: 34172000 PMCID: PMC8235852 DOI: 10.1186/s12864-021-07767-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 06/03/2021] [Indexed: 11/23/2022] Open
Abstract
Background Oxford Nanopore Technology (ONT) long-read sequencing has become a popular platform for microbial researchers due to the accessibility and affordability of its devices. However, easy and automated construction of high-quality bacterial genomes using nanopore reads remains challenging. Here we aimed to create a reproducible end-to-end bacterial genome assembly pipeline using ONT in combination with Illumina sequencing. Results We evaluated the performance of several popular tools used during genome reconstruction, including base-calling, filtering, assembly, and polishing. We also assessed overall genome accuracy using ONT both natively and with Illumina. All steps were validated using the high-quality complete reference genome for the Escherichia coli sequence type (ST)131 strain EC958. Software chosen at each stage were incorporated into our final pipeline, MicroPIPE. Further validation of MicroPIPE was carried out using 11 additional ST131 E. coli isolates, which demonstrated that complete circularised chromosomes and plasmids could be achieved without manual intervention. Twelve publicly available Gram-negative and Gram-positive bacterial genomes (with available raw ONT data and matched complete genomes) were also assembled using MicroPIPE. We found that revised basecalling and updated assembly of the majority of these genomes resulted in improved accuracy compared to the current publicly available complete genomes. Conclusions MicroPIPE is built in modules using Singularity container images and the bioinformatics workflow manager Nextflow, allowing changes and adjustments to be made in response to future tool development. Overall, MicroPIPE provides an easy-access, end-to-end solution for attaining high-quality bacterial genomes. MicroPIPE is available at https://github.com/BeatsonLab-MicrobialGenomics/micropipe. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07767-z.
Collapse
Affiliation(s)
- Valentine Murigneux
- QCIF Facility for Advanced Bioinformatics, Institute for Molecular Bioscience, The University of Queensland, Brisbane, Queensland, Australia
| | - Leah W Roberts
- University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia. .,Queensland Children's Hospital, Brisbane, Queensland, Australia. .,European Bioinformatics Institute, European Molecular Biology Laboratory (EMBL), Hinxton, Cambridge, UK.
| | - Brian M Forde
- University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia
| | - Minh-Duy Phan
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
| | - Nguyen Thi Khanh Nhu
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
| | - Adam D Irwin
- University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia.,Queensland Children's Hospital, Brisbane, Queensland, Australia
| | - Patrick N A Harris
- University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia.,Central Microbiology, Pathology Queensland, Royal Brisbane & Women's Hospital, Brisbane, Queensland, Australia
| | - David L Paterson
- University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia
| | - Mark A Schembri
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia
| | - David M Whiley
- University of Queensland Centre for Clinical Research, Brisbane, Queensland, Australia.,Queensland Children's Hospital, Brisbane, Queensland, Australia
| | - Scott A Beatson
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia. .,Australian Centre for Ecogenomics, The University of Queensland, Brisbane, Queensland, Australia.
| |
Collapse
|
40
|
Tvedte ES, Gasser M, Sparklin BC, Michalski J, Hjelmen CE, Johnston JS, Zhao X, Bromley R, Tallon LJ, Sadzewicz L, Rasko DA, Dunning Hotopp JC. Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes. G3 (BETHESDA, MD.) 2021; 11:jkab083. [PMID: 33768248 PMCID: PMC8495745 DOI: 10.1093/g3journal/jkab083] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 03/07/2021] [Indexed: 12/14/2022]
Abstract
The newest generation of DNA sequencing technology is highlighted by the ability to generate sequence reads hundreds of kilobases in length. Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) have pioneered competitive long read platforms, with more recent work focused on improving sequencing throughput and per-base accuracy. We used whole-genome sequencing data produced by three PacBio protocols (Sequel II CLR, Sequel II HiFi, RS II) and two ONT protocols (Rapid Sequencing and Ligation Sequencing) to compare assemblies of the bacteria Escherichia coli and the fruit fly Drosophila ananassae. In both organisms tested, Sequel II assemblies had the highest consensus accuracy, even after accounting for differences in sequencing throughput. ONT and PacBio CLR had the longest reads sequenced compared to PacBio RS II and HiFi, and genome contiguity was highest when assembling these datasets. ONT Rapid Sequencing libraries had the fewest chimeric reads in addition to superior quantification of E. coli plasmids versus ligation-based libraries. The quality of assemblies can be enhanced by adopting hybrid approaches using Illumina libraries for bacterial genome assembly or polishing eukaryotic genome assemblies, and an ONT-Illumina hybrid approach would be more cost-effective for many users. Genome-wide DNA methylation could be detected using both technologies, however ONT libraries enabled the identification of a broader range of known E. coli methyltransferase recognition motifs in addition to undocumented D. ananassae motifs. The ideal choice of long read technology may depend on several factors including the question or hypothesis under examination. No single technology outperformed others in all metrics examined.
Collapse
Affiliation(s)
- Eric S Tvedte
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Mark Gasser
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Benjamin C Sparklin
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Jane Michalski
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Carl E Hjelmen
- Department of Biology, Texas A&M University, College Station, TX 77843, USA
| | - J Spencer Johnston
- Department of Entomology, Texas A&M University, College Station, TX 77843, USA
| | - Xuechu Zhao
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Robin Bromley
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Luke J Tallon
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Lisa Sadzewicz
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - David A Rasko
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Julie C Dunning Hotopp
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
- Greenebaum Cancer Center, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| |
Collapse
|
41
|
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes. Nat Methods 2021; 18:507-519. [PMID: 33963355 PMCID: PMC8148648 DOI: 10.1038/s41592-021-01128-0] [Citation(s) in RCA: 89] [Impact Index Per Article: 22.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 03/26/2021] [Indexed: 02/03/2023]
Abstract
RNA-binding proteins (RBPs) are critical regulators of gene expression and RNA processing that are required for gene function. Yet the dynamics of RBP regulation in single cells is unknown. To address this gap in understanding, we developed STAMP (Surveying Targets by APOBEC-Mediated Profiling), which efficiently detects RBP-RNA interactions. STAMP does not rely on ultraviolet cross-linking or immunoprecipitation and, when coupled with single-cell capture, can identify RBP-specific and cell-type-specific RNA-protein interactions for multiple RBPs and cell types in single, pooled experiments. Pairing STAMP with long-read sequencing yields RBP target sites in an isoform-specific manner. Finally, Ribo-STAMP leverages small ribosomal subunits to measure transcriptome-wide ribosome association in single cells. STAMP enables the study of RBP-RNA interactomes and translational landscapes with unprecedented cellular resolution.
Collapse
|
42
|
Du N, Shang J, Sun Y. Improving protein domain classification for third-generation sequencing reads using deep learning. BMC Genomics 2021; 22:251. [PMID: 33836667 PMCID: PMC8033682 DOI: 10.1186/s12864-021-07468-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2020] [Accepted: 02/19/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND With the development of third-generation sequencing (TGS) technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in TGS data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in TGS data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. RESULTS In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for TGS reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject reads not containing the targeted domains. In the experiments on simulated long reads of protein coding sequences and real TGS reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. CONCLUSIONS In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction.
Collapse
Affiliation(s)
- Nan Du
- Computer Science and Engineering, Michigan State University, East Lansing, 48824 USA
| | - Jiayu Shang
- Electrical Engineering, City University of Hong Kong, Hong Kong, People’s Republic of China
| | - Yanni Sun
- Electrical Engineering, City University of Hong Kong, Hong Kong, People’s Republic of China
| |
Collapse
|
43
|
Li Y, Ma L, Wu D, Chen G. Advances in bulk and single-cell multi-omics approaches for systems biology and precision medicine. Brief Bioinform 2021; 22:6189773. [PMID: 33778867 DOI: 10.1093/bib/bbab024] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2020] [Revised: 12/31/2020] [Accepted: 01/20/2021] [Indexed: 12/13/2022] Open
Abstract
Multi-omics allows the systematic understanding of the information flow across different omics layers, while single omics can mainly reflect one aspect of the biological system. The advancement of bulk and single-cell sequencing technologies and related computational methods for multi-omics largely facilitated the development of system biology and precision medicine. Single-cell approaches have the advantage of dissecting cellular dynamics and heterogeneity, whereas traditional bulk technologies are limited to individual/population-level investigation. In this review, we first summarize the technologies for producing bulk and single-cell multi-omics data. Then, we survey the computational approaches for integrative analysis of bulk and single-cell multimodal data, respectively. Moreover, the databases and data storage for multi-omics, as well as the tools for visualizing multimodal data are summarized. We also outline the integration between bulk and single-cell data, and discuss the applications of multi-omics in precision medicine. Finally, we present the challenges and perspectives for multi-omics development.
Collapse
Affiliation(s)
| | - Lu Ma
- China Normal University, China
| | | | | |
Collapse
|
44
|
Broseus L, Thomas A, Oldfield AJ, Severac D, Dubois E, Ritchie W. TALC: Transcript-level Aware Long-read Correction. Bioinformatics 2021; 36:5000-5006. [PMID: 32910174 DOI: 10.1093/bioinformatics/btaa634] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 05/08/2020] [Accepted: 07/09/2020] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Long-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous 'hybrid correction' algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data. RESULTS We have created a novel reference-free algorithm called Transcript-level Aware Long-Read Correction (TALC) which models changes in RNA expression and isoform representation in a weighted De Bruijn graph to correct long reads from transcriptome studies. We show that transcript-level aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology. AVAILABILITY AND IMPLEMENTATION TALC is implemented in C++ and available at https://github.com/lbroseus/TALC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucile Broseus
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Aubin Thomas
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Andrew J Oldfield
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| | - Dany Severac
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - Emeric Dubois
- MGX-Montpellier GenomiX, c/o Institut de Génomique Fonctionnelle, Montpellier Cedex 5 34094, France
| | - William Ritchie
- Department of Genome Dynamics, Institut de Génétique Humaine, Centre National de la Recherche Scientifique (CNRS), Université de Montpellier, Montpellier 34396, France
| |
Collapse
|
45
|
Ciuffreda L, Rodríguez-Pérez H, Flores C. Nanopore sequencing and its application to the study of microbial communities. Comput Struct Biotechnol J 2021; 19:1497-1511. [PMID: 33815688 PMCID: PMC7985215 DOI: 10.1016/j.csbj.2021.02.020] [Citation(s) in RCA: 105] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 02/24/2021] [Accepted: 02/27/2021] [Indexed: 12/14/2022] Open
Abstract
Since its introduction, nanopore sequencing has enhanced our ability to study complex microbial samples through the possibility to sequence long reads in real time using inexpensive and portable technologies. The use of long reads has allowed to address several previously unsolved issues in the field, such as the resolution of complex genomic structures, and facilitated the access to metagenome assembled genomes (MAGs). Furthermore, the low cost and portability of platforms together with the development of rapid protocols and analysis pipelines have featured nanopore technology as an attractive and ever-growing tool for real-time in-field sequencing for environmental microbial analysis. This review provides an up-to-date summary of the experimental protocols and bioinformatic tools for the study of microbial communities using nanopore sequencing, highlighting the most important and recent research in the field with a major focus on infectious diseases. An overview of the main approaches including targeted and shotgun approaches, metatranscriptomics, epigenomics, and epitranscriptomics is provided, together with an outlook to the major challenges and perspectives over the use of this technology for microbial studies.
Collapse
Affiliation(s)
- Laura Ciuffreda
- Research Unit, Hospital Universitario N.S. de Candelaria, Universidad de La Laguna, 38010 Santa Cruz de Tenerife, Spain
| | - Héctor Rodríguez-Pérez
- Research Unit, Hospital Universitario N.S. de Candelaria, Universidad de La Laguna, 38010 Santa Cruz de Tenerife, Spain
| | - Carlos Flores
- Research Unit, Hospital Universitario N.S. de Candelaria, Universidad de La Laguna, 38010 Santa Cruz de Tenerife, Spain
- CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, 28029 Madrid, Spain
- Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), 38600 Santa Cruz de Tenerife, Spain
- Instituto de Tecnologías Biomédicas (ITB), Universidad de La Laguna, 38200 Santa Cruz de Tenerife, Spain
| |
Collapse
|
46
|
van Belzen IAEM, Schönhuth A, Kemmeren P, Hehir-Kwa JY. Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology. NPJ Precis Oncol 2021; 5:15. [PMID: 33654267 PMCID: PMC7925608 DOI: 10.1038/s41698-021-00155-6] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Accepted: 01/12/2021] [Indexed: 01/31/2023] Open
Abstract
Cancer is generally characterized by acquired genomic aberrations in a broad spectrum of types and sizes, ranging from single nucleotide variants to structural variants (SVs). At least 30% of cancers have a known pathogenic SV used in diagnosis or treatment stratification. However, research into the role of SVs in cancer has been limited due to difficulties in detection. Biological and computational challenges confound SV detection in cancer samples, including intratumor heterogeneity, polyploidy, and distinguishing tumor-specific SVs from germline and somatic variants present in healthy cells. Classification of tumor-specific SVs is challenging due to inconsistencies in detected breakpoints, derived variant types and biological complexity of some rearrangements. Full-spectrum SV detection with high recall and precision requires integration of multiple algorithms and sequencing technologies to rescue variants that are difficult to resolve through individual methods. Here, we explore current strategies for integrating SV callsets and to enable the use of tumor-specific SVs in precision oncology.
Collapse
Affiliation(s)
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Patrick Kemmeren
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands
| | - Jayne Y Hehir-Kwa
- Princess Máxima Center for Pediatric Oncology, Utrecht, The Netherlands.
| |
Collapse
|
47
|
Hu K, Huang N, Zou Y, Liao X, Wang J. MultiNanopolish: Refined grouping method for reducing redundant calculations in nanopolish. Bioinformatics 2021; 37:2757-2760. [PMID: 33532819 DOI: 10.1093/bioinformatics/btab078] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 01/19/2021] [Accepted: 01/29/2021] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Compared with the second generation sequencing technologies, the third generation sequencing technologies allows us to obtain longer reads (average ∼10kbps, maximum 900kbps), but brings a higher error rate (∼15% error rate). Nanopolish is a variant and methylation detection tool based on Hidden Markov Model (HMM), which uses Oxford Nanopore sequencing data for signal-level analysis. Nanopolish can greatly improve the accuracy of assembly, whereas it is limited by long running time since most executive parts of Nanopolish is a serial and computationally expensive process. RESULTS In this paper, we present an effective polishing tool, Multithreading Nanopolish (MultiNanopolish), which decomposes the whole process of iterative calculation in Nanopolish into small independent calculation tasks, making it possible to run this process in the parallel mode. Experimental results show that MultiNanopolish reduces running time by 50% with read-uncorrected assembler (Miniasm) and 20% with read-corrected assembler (Canu and Flye) based on 40 threads mode compared to the original Nanopolish. AVAILABILITY MultiNanopolish is available at GitHub: https://github.com/BioinformaticsCSU/MultiNanopolish. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kang Hu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Neng Huang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - You Zou
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Xingyu Liao
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| |
Collapse
|
48
|
Saud Z, Kortsinoglou AM, Kouvelis VN, Butt TM. Telomere length de novo assembly of all 7 chromosomes and mitogenome sequencing of the model entomopathogenic fungus, Metarhizium brunneum, by means of a novel assembly pipeline. BMC Genomics 2021; 22:87. [PMID: 33509090 PMCID: PMC7842015 DOI: 10.1186/s12864-021-07390-y] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Accepted: 01/13/2021] [Indexed: 12/31/2022] Open
Abstract
Background More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum. Results The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. Conclusions The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07390-y.
Collapse
Affiliation(s)
- Zack Saud
- Department of Biosciences, College of Science, Swansea University, Singleton Park, Swansea, Wales, SA2 8PP, UK.
| | - Alexandra M Kortsinoglou
- Department of Genetics and Biotechnology, Faculty of Biology, National and Kapodistrian University of Athens, Panepistimiopolis, 15701, Athens, Greece
| | - Vassili N Kouvelis
- Department of Genetics and Biotechnology, Faculty of Biology, National and Kapodistrian University of Athens, Panepistimiopolis, 15701, Athens, Greece
| | - Tariq M Butt
- Department of Biosciences, College of Science, Swansea University, Singleton Park, Swansea, Wales, SA2 8PP, UK.
| |
Collapse
|
49
|
Holley G, Beyter D, Ingimundardottir H, Møller PL, Kristmundsdottir S, Eggertsson HP, Halldorsson BV. Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly. Genome Biol 2021; 22:28. [PMID: 33419473 PMCID: PMC7792008 DOI: 10.1186/s13059-020-02244-4] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 12/15/2020] [Indexed: 12/20/2022] Open
Abstract
A major challenge to long read sequencing data is their high error rate of up to 15%. We present Ratatosk, a method to correct long reads with short read data. We demonstrate on 5 human genome trios that Ratatosk reduces the error rate of long reads 6-fold on average with a median error rate as low as 0.22 %. SNP calls in Ratatosk corrected reads are nearly 99 % accurate and indel calls accuracy is increased by up to 37 %. An assembly of Ratatosk corrected reads from an Ashkenazi individual yields a contig N50 of 45 Mbp and less misassemblies than a PacBio HiFi reads assembly.
Collapse
Affiliation(s)
| | | | | | - Peter L Møller
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
| | - Snædis Kristmundsdottir
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| | | | - Bjarni V Halldorsson
- deCODE genetics/Amgen Inc., Reykjavík, Iceland
- School of Technology, Reykjavik University, Reykjavík, Iceland
| |
Collapse
|
50
|
Hayrabedyan S, Kostova P, Zlatkov V, Todorova K. Single-cell transcriptomics in the context of long-read nanopore sequencing. BIOTECHNOL BIOTEC EQ 2021. [DOI: 10.1080/13102818.2021.1988868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
Affiliation(s)
- Soren Hayrabedyan
- Laboratory of Reproductive OMICs Technologies, Institute of Biology and Immunology of Reproduction, Bulgarian Academy of Sciences, Sofia, Bulgaria
| | - Petya Kostova
- Gynecology Clinic, National Oncology Hospital, Sofia, Bulgaria
| | - Viktor Zlatkov
- Department of Obstetrics and Gynecology, Faculty of Medicine, Medical University of Sofia, Sofia, Bulgaria
| | - Krassimira Todorova
- Laboratory of Reproductive OMICs Technologies, Institute of Biology and Immunology of Reproduction, Bulgarian Academy of Sciences, Sofia, Bulgaria
| |
Collapse
|