1
|
Luo J, Guan T, Chen G, Yu Z, Zhai H, Yan C, Luo H. SLHSD: hybrid scaffolding method based on short and long reads. Brief Bioinform 2023; 24:7152317. [PMID: 37141142 DOI: 10.1093/bib/bbad169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 01/08/2023] [Accepted: 04/12/2023] [Indexed: 05/05/2023] Open
Abstract
In genome assembly, scaffolding can obtain more complete and continuous scaffolds. Current scaffolding methods usually adopt one type of read to construct a scaffold graph and then orient and order contigs. However, scaffolding with the strengths of two or more types of reads seems to be a better solution to some tricky problems. Combining the advantages of different types of data is significant for scaffolding. Here, a hybrid scaffolding method (SLHSD) is present that simultaneously leverages the precision of short reads and the length advantage of long reads. Building an optimal scaffold graph is an important foundation for getting scaffolds. SLHSD uses a new algorithm that combines long and short read alignment information to determine whether to add an edge and how to calculate the edge weight in a scaffold graph. In addition, SLHSD develops a strategy to ensure that edges with high confidence can be added to the graph with priority. Then, a linear programming model is used to detect and remove remaining false edges in the graph. We compared SLHSD with other scaffolding methods on five datasets. Experimental results show that SLHSD outperforms other methods. The open-source code of SLHSD is available at https://github.com/luojunwei/SLHSD.
Collapse
Affiliation(s)
- Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Ting Guan
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Guolin Chen
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Zhonghua Yu
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Haixia Zhai
- School of Software, Henan Polytechnic University, Jiaozuo 454003, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
| |
Collapse
|
2
|
Luo J, Wei Y, Lyu M, Wu Z, Liu X, Luo H, Yan C. A comprehensive review of scaffolding methods in genome assembly. Brief Bioinform 2021; 22:6149347. [PMID: 33634311 DOI: 10.1093/bib/bbab033] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 12/20/2022] Open
Abstract
In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.
Collapse
Affiliation(s)
- Junwei Luo
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Yawei Wei
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Mengna Lyu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Zhengjiang Wu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Xiaoyan Liu
- College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, China
| | - Huimin Luo
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| | - Chaokun Yan
- School of Computer and Information Engineering, Henan University, Kaifeng, China
| |
Collapse
|
3
|
|
4
|
El-Khishin DA, Ageez A, Saad ME, Ibrahim A, Shokrof M, Hassan LR, Abouelhoda MI. Sequencing and assembly of the Egyptian buffalo genome. PLoS One 2020; 15:e0237087. [PMID: 32813723 PMCID: PMC7437910 DOI: 10.1371/journal.pone.0237087] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Accepted: 07/20/2020] [Indexed: 01/09/2023] Open
Abstract
Water buffalo (Bubalus bubalis) is an important source of meat and milk in countries with relatively warm weather. Compared to the cattle genome, a little has been done to reveal its genome structure and genomic traits. This is due to the complications stemming from the large genome size, the complexity of the genome, and the high repetitive content. In this paper, we introduce a high-quality draft assembly of the Egyptian water buffalo genome. The Egyptian breed is used as a dual purpose animal (milk/meat). It is distinguished by its adaptability to the local environment, quality of feed changes, as well as its high resistance to diseases. The genome assembly of the Egyptian water buffalo has been achieved using a reference-based assembly workflow. Our workflow significantly reduced the computational complexity of the assembly process, and improved the assembly quality by integrating different public resources. We also compared our assembly to the currently available draft assemblies of water buffalo breeds. A total of 21,128 genes were identified in the produced assembly. A list of milk virgin-related genes; milk pregnancy-related genes; milk lactation-related genes; milk involution-related genes; and milk mastitis-related genes were identified in the assembly. Our results will significantly contribute to a better understanding of the genetics of the Egyptian water buffalo which will eventually support the ongoing breeding efforts and facilitate the future discovery of genes responsible for complex processes of dairy, meat production and disease resistance among other significant traits.
Collapse
Affiliation(s)
- Dina A. El-Khishin
- Agricultural Genetic Engineering Research Institute (AGERI), Agricultural Research Center (ARC), Giza, Egypt
- * E-mail:
| | - Amr Ageez
- Agricultural Genetic Engineering Research Institute (AGERI), Agricultural Research Center (ARC), Giza, Egypt
- Faculty of Biotechnology, MSA University, October City, Egypt
| | - Mohamed E. Saad
- Agricultural Genetic Engineering Research Institute (AGERI), Agricultural Research Center (ARC), Giza, Egypt
- Department of Biology, Taibah University, Almadinah Almonawarah, KSA
| | - Amr Ibrahim
- Agricultural Genetic Engineering Research Institute (AGERI), Agricultural Research Center (ARC), Giza, Egypt
| | - Moustafa Shokrof
- Centre for Informatics Sciences, Nile University Giza, October city, Egypt
- Department of Computer Science, University of California at Davis, Davis, CA, United States of America
| | - Laila R. Hassan
- Animal Production Research Institute, Agricultural Research Center (ARC), Ministry of Agriculture and Land Reclamation, Giza, Egypt
| | - Mohamed I. Abouelhoda
- Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
| |
Collapse
|
5
|
Affiliation(s)
- Weihua Pan
- Department of Computer Science and Engineering, University of California, Riverside, California
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, California
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, California
| |
Collapse
|
6
|
Andonov R, Djidjev H, François S, Lavenier D. Complete assembly of circular and chloroplast genomes based on global optimization. J Bioinform Comput Biol 2019; 17:1950014. [PMID: 31288643 DOI: 10.1142/s0219720019500148] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This paper focuses on the last two stages of genome assembly, namely, scaffolding and gap-filling, and shows that they can be solved as part of a single optimization problem. Our approach is based on modeling genome assembly as a problem of finding a simple path in a specific graph that satisfies as many distance constraints as possible encoding the insert-size information. We formulate it as a mixed-integer linear programming (MILP) problem and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. We show that the presence of repetitions in the set of unitigs is the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. We also describe two sufficient conditions and we design efficient algorithms for identifying these subpaths. Comparisons of the results achieved by our tool with the ones obtained with recent assemblers are presented.
Collapse
Affiliation(s)
- Rumen Andonov
- * Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France
| | - Hristo Djidjev
- † Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | | | | |
Collapse
|
7
|
Sutton TDS, Clooney AG, Ryan FJ, Ross RP, Hill C. Choice of assembly software has a critical impact on virome characterisation. MICROBIOME 2019; 7:12. [PMID: 30691529 PMCID: PMC6350398 DOI: 10.1186/s40168-019-0626-5] [Citation(s) in RCA: 98] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 01/14/2019] [Indexed: 05/19/2023]
Abstract
BACKGROUND The viral component of microbial communities plays a vital role in driving bacterial diversity, facilitating nutrient turnover and shaping community composition. Despite their importance, the vast majority of viral sequences are poorly annotated and share little or no homology to reference databases. As a result, investigation of the viral metagenome (virome) relies heavily on de novo assembly of short sequencing reads to recover compositional and functional information. Metagenomic assembly is particularly challenging for virome data, often resulting in fragmented assemblies and poor recovery of viral community members. Despite the essential role of assembly in virome analysis and difficulties posed by these data, current assembly comparisons have been limited to subsections of virome studies or bacterial datasets. DESIGN This study presents the most comprehensive virome assembly comparison to date, featuring 16 metagenomic assembly approaches which have featured in human virome studies. Assemblers were assessed using four independent virome datasets, namely, simulated reads, two mock communities, viromes spiked with a known phage and human gut viromes. RESULTS Assembly performance varied significantly across all test datasets, with SPAdes (meta) performing consistently well. Performance of MIRA and VICUNA varied, highlighting the importance of using a range of datasets when comparing assembly programs. It was also found that while some assemblers addressed the challenges of virome data better than others, all assemblers had limitations. Low read coverage and genomic repeats resulted in assemblies with poor genome recovery, high degrees of fragmentation and low-accuracy contigs across all assemblers. These limitations must be considered when setting thresholds for downstream analysis and when drawing conclusions from virome data.
Collapse
Affiliation(s)
- Thomas D S Sutton
- APC Microbiome Ireland, Cork, Ireland
- School for Microbiology, University College Cork, Cork, Ireland
| | - Adam G Clooney
- APC Microbiome Ireland, Cork, Ireland
- School for Microbiology, University College Cork, Cork, Ireland
| | - Feargal J Ryan
- APC Microbiome Ireland, Cork, Ireland
- School for Microbiology, University College Cork, Cork, Ireland
- Present Address: South Australian Health and Medical Research Institute, Adelaide, Australia
| | - R Paul Ross
- APC Microbiome Ireland, Cork, Ireland
- School for Microbiology, University College Cork, Cork, Ireland
- Teagasc Food Research Centre, Fermoy, Cork, Ireland
| | - Colin Hill
- APC Microbiome Ireland, Cork, Ireland.
- School for Microbiology, University College Cork, Cork, Ireland.
| |
Collapse
|
8
|
SCOP: a novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018; 35:1142-1150. [DOI: 10.1093/bioinformatics/bty773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 08/10/2018] [Accepted: 09/01/2018] [Indexed: 12/20/2022] Open
|
9
|
Li M, Tang L, Liao Z, Luo J, Wu F, Pan Y, Wang J. A novel scaffolding algorithm based on contig error correction and path extension. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 16:764-773. [PMID: 30040649 DOI: 10.1109/tcbb.2018.2858267] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The sequence assembly process can be divided into three stages: contigs extension, scaffolding, and gap filling. The scaffolding method is an essential step during the process to infer the direction and sequence relationships between the contigs. However, scaffolding still faces the challenges of uneven sequencing depth, genome repetitive regions, and sequencing errors, which often leads to many false relationships between contigs. The performance of scaffolding can be improved by removing potential false conjunctions between contigs. In this study, a novel scaffolding algorithm which is on the basis of path extension Loose-Strict-Loose strategy and contig error correction, called iLSLS. iLSLS helps reduce the false relationships between contigs, and improve the accuracy of subsequent steps. iLSLS utilizes a scoring function, which estimates the correctness of candidate paths by the distribution of paired reads, and try to conduction the extension with the path which is scored the highest. What's more, iLSLS can precisely estimate the gap size. We conduct experiments on two real datasets, and the results show that LSLS strategy is efficient to increase the correctness of scaffolds, and iLSLS performs better than other scaffolding methods.
Collapse
|
10
|
Aganezov SS, Alekseyev MA. CAMSA: a tool for comparative analysis and merging of scaffold assemblies. BMC Bioinformatics 2017; 18:496. [PMID: 29244014 PMCID: PMC5731503 DOI: 10.1186/s12859-017-1919-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Despite the recent progress in genome sequencing and assembly, many of the currently available assembled genomes come in a draft form. Such draft genomes consist of a large number of genomic fragments (scaffolds), whose positions and orientations along the genome are unknown. While there exists a number of methods for reconstruction of the genome from its scaffolds, utilizing various computational and wet-lab techniques, they often can produce only partial error-prone scaffold assemblies. It therefore becomes important to compare and merge scaffold assemblies produced by different methods, thus combining their advantages and highlighting present conflicts for further investigation. These tasks may be labor intensive if performed manually. RESULTS We present CAMSA-a tool for comparative analysis and merging of two or more given scaffold assemblies. The tool (i) creates an extensive report with several comparative quality metrics; (ii) constructs the most confident merged scaffold assembly; and (iii) provides an interactive framework for a visual comparative analysis of the given assemblies. Among the CAMSA features, only scaffold merging can be evaluated in comparison to existing methods. Namely, it resembles the functionality of assembly reconciliation tools, although their primary targets are somewhat different. Our evaluations show that CAMSA produces merged assemblies of comparable or better quality than existing assembly reconciliation tools while being the fastest in terms of the total running time. CONCLUSIONS CAMSA addresses the current deficiency of tools for automated comparison and analysis of multiple assemblies of the same set scaffolds. Since there exist numerous methods and techniques for scaffold assembly, identifying similarities and dissimilarities across assemblies produced by different methods is beneficial both for the developers of scaffold assembly algorithms and for the researchers focused on improving draft assemblies of specific organisms.
Collapse
Affiliation(s)
- Sergey S Aganezov
- Princeton University, 35 Olden St., Princeton, 08450, NJ, USA. .,ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia.
| | - Max A Alekseyev
- The George Washington University, 45085 University Dr., Suite 305, Ashburn, 20147, VA, USA
| |
Collapse
|
11
|
Martin G, Baurens FC, Droc G, Rouard M, Cenci A, Kilian A, Hastie A, Doležel J, Aury JM, Alberti A, Carreel F, D'Hont A. Improvement of the banana "Musa acuminata" reference sequence using NGS data and semi-automated bioinformatics methods. BMC Genomics 2016; 17:243. [PMID: 26984673 PMCID: PMC4793746 DOI: 10.1186/s12864-016-2579-4] [Citation(s) in RCA: 79] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2015] [Accepted: 03/08/2016] [Indexed: 12/04/2022] Open
Abstract
Background Recent advances in genomics indicate functional significance of a majority of genome sequences and their long range interactions. As a detailed examination of genome organization and function requires very high quality genome sequence, the objective of this study was to improve reference genome assembly of banana (Musa acuminata). Results We have developed a modular bioinformatics pipeline to improve genome sequence assemblies, which can handle various types of data. The pipeline comprises several semi-automated tools. However, unlike classical automated tools that are based on global parameters, the semi-automated tools proposed an expert mode for a user who can decide on suggested improvements through local compromises. The pipeline was used to improve the draft genome sequence of Musa acuminata. Genotyping by sequencing (GBS) of a segregating population and paired-end sequencing were used to detect and correct scaffold misassemblies. Long insert size paired-end reads identified scaffold junctions and fusions missed by automated assembly methods. GBS markers were used to anchor scaffolds to pseudo-molecules with a new bioinformatics approach that avoids the tedious step of marker ordering during genetic map construction. Furthermore, a genome map was constructed and used to assemble scaffolds into super scaffolds. Finally, a consensus gene annotation was projected on the new assembly from two pre-existing annotations. This approach reduced the total Musa scaffold number from 7513 to 1532 (i.e. by 80 %), with an N50 that increased from 1.3 Mb (65 scaffolds) to 3.0 Mb (26 scaffolds). 89.5 % of the assembly was anchored to the 11 Musa chromosomes compared to the previous 70 %. Unknown sites (N) were reduced from 17.3 to 10.0 %. Conclusion The release of the Musa acuminata reference genome version 2 provides a platform for detailed analysis of banana genome variation, function and evolution. Bioinformatics tools developed in this work can be used to improve genome sequence assemblies in other species. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2579-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Guillaume Martin
- CIRAD (Centre de coopération Internationale en Recherche Agronomique pour le Développement), UMR AGAP, TA A-108/03, Avenue Agropolis, F-34398, Montpellier, cedex 5, France
| | - Franc-Christophe Baurens
- CIRAD (Centre de coopération Internationale en Recherche Agronomique pour le Développement), UMR AGAP, TA A-108/03, Avenue Agropolis, F-34398, Montpellier, cedex 5, France
| | - Gaëtan Droc
- CIRAD (Centre de coopération Internationale en Recherche Agronomique pour le Développement), UMR AGAP, TA A-108/03, Avenue Agropolis, F-34398, Montpellier, cedex 5, France
| | - Mathieu Rouard
- Bioversity International, Parc Scientifique Agropolis II, 34397, Montpellier, Cedex 5, France
| | - Alberto Cenci
- Bioversity International, Parc Scientifique Agropolis II, 34397, Montpellier, Cedex 5, France
| | - Andrzej Kilian
- Diversity Arrays Technology, Yarralumla, Australian Capital Territory, 2600, Australia
| | - Alex Hastie
- BioNano Genomics, 9640 Towne Centre Drive, San Diego, CA, 92121, USA
| | - Jaroslav Doležel
- Institute of Experimental Botany, Centre of the Region Hana for Biotechnological and Agricultural Research, Šlechtitelů 31, CZ-78371, Olomouc, Czech Republic
| | - Jean-Marc Aury
- Commissariat à l'Energie Atomique (CEA), Institut de Genomique (IG), Genoscope, 2 rue Gaston Cremieux, BP5706, 91057, Evry, France
| | - Adriana Alberti
- Commissariat à l'Energie Atomique (CEA), Institut de Genomique (IG), Genoscope, 2 rue Gaston Cremieux, BP5706, 91057, Evry, France
| | - Françoise Carreel
- CIRAD (Centre de coopération Internationale en Recherche Agronomique pour le Développement), UMR AGAP, TA A-108/03, Avenue Agropolis, F-34398, Montpellier, cedex 5, France
| | - Angélique D'Hont
- CIRAD (Centre de coopération Internationale en Recherche Agronomique pour le Développement), UMR AGAP, TA A-108/03, Avenue Agropolis, F-34398, Montpellier, cedex 5, France.
| |
Collapse
|
12
|
Huang KW, Chen JL, Yang CS, Tsai CW. A memetic gravitation search algorithm for solving DNA fragment assembly problems. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/ifs-151994] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ko-Wei Huang
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
- Department of Psychology, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Jui-Le Chen
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
- Department of Computer Science and Entertainment Technology, Tajen university, Pingtung, Taiwan, R.O.C
| | - Chu-Sing Yang
- Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C
| | - Chun-Wei Tsai
- Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan, R.O.C
| |
Collapse
|
13
|
Abstract
This paper presents new structural and algorithmic results around the scaffolding problem, which occurs prominently in next generation sequencing. The problem can be formalized as an optimization problem on a special graph, the "scaffold graph". We prove that the problem is polynomial if this graph is a tree by providing a dynamic programming algorithm for this case. This algorithm serves as a basis to deduce an exact algorithm for general graphs using a tree decomposition of the input. We explore other structural parameters, proving a linear-size problem kernel with respect to the size of a feedback-edge set on a restricted version of Scaffolding. Finally, we examine some parameters of scaffold graphs, which are based on real-world genomes, revealing that the feedback edge set is significantly smaller than the input size.
Collapse
Affiliation(s)
- Mathias Weller
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - Université de Montpellier - UMR 5506 CNRS, 161 rue Ada, 34090 Montpellier, France
- Institut de Biologie Computationnelle, Lirmm Bât 5 - 860 rue de St Priest, 34090 Montpellier, France
| | - Annie Chateau
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - Université de Montpellier - UMR 5506 CNRS, 161 rue Ada, 34090 Montpellier, France
- Institut de Biologie Computationnelle, Lirmm Bât 5 - 860 rue de St Priest, 34090 Montpellier, France
| | - Rodolphe Giroudeau
- Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM) - Université de Montpellier - UMR 5506 CNRS, 161 rue Ada, 34090 Montpellier, France
| |
Collapse
|
14
|
Anselmetti Y, Berry V, Chauve C, Chateau A, Tannier E, Bérard S. Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genomics 2015; 16 Suppl 10:S11. [PMID: 26450761 PMCID: PMC4603332 DOI: 10.1186/1471-2164-16-s10-s11] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
We exploit the methodological similarity between ancestral genome reconstruction and extant genome scaffolding. We present a method, called ARt-DeCo that constructs neighborhood relationships between genes or contigs, in both ancestral and extant genomes, in a phylogenetic context. It is able to handle dozens of complete genomes, including genes with complex histories, by using gene phylogenies reconciled with a species tree, that is, annotated with speciation, duplication and loss events. Reconstructed ancestral or extant synteny comes with a support computed from an exhaustive exploration of the solution space. We compare our method with a previously published one that follows the same goal on a small number of genomes with universal unicopy genes. Then we test it on the whole Ensembl database, by proposing partial ancestral genome structures, as well as a more complete scaffolding for many partially assembled genomes on 69 eukaryote species. We carefully analyze a couple of extant adjacencies proposed by our method, and show that they are indeed real links in the extant genomes, that were missing in the current assembly. On a reduced data set of 39 eutherian mammals, we estimate the precision and sensitivity of ARt-DeCo by simulating a fragmentation in some well assembled genomes, and measure how many adjacencies are recovered. We find a very high precision, while the sensitivity depends on the quality of the data and on the proximity of closely related genomes.
Collapse
Affiliation(s)
- Yoann Anselmetti
- Institut des Sciences de l'Évolution de Montpellier (ISE-M), Place Eugène Bataillon, Montpellier, 34095, France
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
| | - Vincent Berry
- Institut de Biologie Computationnelle (IBC), Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), Université Montpellier - CNRS, 161 rue Ada, Montpellier, 34090, France
| | - Cedric Chauve
- Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, V5A 1S6, Canada
| | - Annie Chateau
- Institut de Biologie Computationnelle (IBC), Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), Université Montpellier - CNRS, 161 rue Ada, Montpellier, 34090, France
| | - Eric Tannier
- Laboratoire de Biométrie et Biologie Évolutive, LBBE, UMR CNRS 5558, University of Lyon 1, 43 boulevard du 11 novembre 1918, 69622 Villeurbanne, France
- Institut National de Recherche en Informatique et en Automatique (INRIA) Grenoble Rhône-Alpes, 655 avenue de l'Europe, 38330 Montbonnot, France
| | - Sèverine Bérard
- Institut des Sciences de l'Évolution de Montpellier (ISE-M), Place Eugène Bataillon, Montpellier, 34095, France
- Institut de Biologie Computationnelle (IBC), Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier (LIRMM), Université Montpellier - CNRS, 161 rue Ada, Montpellier, 34090, France
| |
Collapse
|
15
|
Bosi E, Donati B, Galardini M, Brunetti S, Sagot MF, Lió P, Crescenzi P, Fani R, Fondi M. MeDuSa: a multi-draft based scaffolder. Bioinformatics 2015; 31:2443-51. [PMID: 25810435 DOI: 10.1093/bioinformatics/btv171] [Citation(s) in RCA: 302] [Impact Index Per Article: 30.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2014] [Accepted: 03/19/2015] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Completing the genome sequence of an organism is an important task in comparative, functional and structural genomics. However, this remains a challenging issue from both a computational and an experimental viewpoint. Genome scaffolding (i.e. the process of ordering and orientating contigs) of de novo assemblies usually represents the first step in most genome finishing pipelines. RESULTS In this article we present MeDuSa (Multi-Draft based Scaffolder), an algorithm for genome scaffolding. MeDuSa exploits information obtained from a set of (draft or closed) genomes from related organisms to determine the correct order and orientation of the contigs. MeDuSa formalizes the scaffolding problem by means of a combinatorial optimization formulation on graphs and implements an efficient constant factor approximation algorithm to solve it. In contrast to currently used scaffolders, it does not require either prior knowledge on the microrganisms dataset under analysis (e.g. their phylogenetic relationships) or the availability of paired end read libraries. This makes usability and running time two additional important features of our method. Moreover, benchmarks and tests on real bacterial datasets showed that MeDuSa is highly accurate and, in most cases, outperforms traditional scaffolders. The possibility to use MeDuSa on eukaryotic datasets has also been evaluated, leading to interesting results.
Collapse
Affiliation(s)
- Emanuele Bosi
- Department of Biology, ComBo, Florence Computational Biology Group, Department of Biology, LEMM, Laboratory of Microbial and Molecular Evolution Florence, University of Florence, I-50019 Sesto F.no, Italy
| | - Beatrice Donati
- INRIA Rhône-Alpes, Villeurbanne Cedex, France, Université de Lyon, F-69000 Lyon, France, Dipartimento di Ingegneria dell'Informazione, University of Florence, I-50139 Firenze, Italy
| | - Marco Galardini
- EMBL-EBI - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD Cambridge, UK
| | - Sara Brunetti
- Dipartimento di Ingegneria dell'Informazione e Scienze Matematiche, University of Siena, Siena I-53100, Italy
| | - Marie-France Sagot
- INRIA Rhône-Alpes, Villeurbanne Cedex, France, Université de Lyon, F-69000 Lyon, France, Université Lyon 1, CNRS,UMR5558, 69622 Villeurbanne Cedex, France and
| | - Pietro Lió
- Computer Laboratory, University of Cambridge, CB3 0FD Cambridge, UK
| | - Pierluigi Crescenzi
- Dipartimento di Ingegneria dell'Informazione, University of Florence, I-50139 Firenze, Italy
| | - Renato Fani
- Department of Biology, ComBo, Florence Computational Biology Group, Department of Biology, LEMM, Laboratory of Microbial and Molecular Evolution Florence, University of Florence, I-50019 Sesto F.no, Italy
| | - Marco Fondi
- Department of Biology, ComBo, Florence Computational Biology Group, Department of Biology, LEMM, Laboratory of Microbial and Molecular Evolution Florence, University of Florence, I-50019 Sesto F.no, Italy
| |
Collapse
|
16
|
Eastman AW, Yuan ZC. Development and validation of an rDNA operon based primer walking strategy applicable to de novo bacterial genome finishing. Front Microbiol 2015; 5:769. [PMID: 25653642 PMCID: PMC4301005 DOI: 10.3389/fmicb.2014.00769] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 12/16/2014] [Indexed: 01/10/2023] Open
Abstract
Advances in sequencing technology have drastically increased the depth and feasibility of bacterial genome sequencing. However, little information is available that details the specific techniques and procedures employed during genome sequencing despite the large numbers of published genomes. Shotgun approaches employed by second-generation sequencing platforms has necessitated the development of robust bioinformatics tools for in silico assembly, and complete assembly is limited by the presence of repetitive DNA sequences and multi-copy operons. Typically, re-sequencing with multiple platforms and laborious, targeted Sanger sequencing are employed to finish a draft bacterial genome. Here we describe a novel strategy based on the identification and targeted sequencing of repetitive rDNA operons to expedite bacterial genome assembly and finishing. Our strategy was validated by finishing the genome of Paenibacillus polymyxa strain CR1, a bacterium with potential in sustainable agriculture and bio-based processes. An analysis of the 38 contigs contained in the P. polymyxa strain CR1 draft genome revealed 12 repetitive rDNA operons with varied intragenic and flanking regions of variable length, unanimously located at contig boundaries and within contig gaps. These highly similar but not identical rDNA operons were experimentally verified and sequenced simultaneously with multiple, specially designed primer sets. This approach also identified and corrected significant sequence rearrangement generated during the initial in silico assembly of sequencing reads. Our approach reduces the required effort associated with blind primer walking for contig assembly, increasing both the speed and feasibility of genome finishing. Our study further reinforces the notion that repetitive DNA elements are major limiting factors for genome finishing. Moreover, we provided a step-by-step workflow for genome finishing, which may guide future bacterial genome finishing projects.
Collapse
Affiliation(s)
- Alexander W Eastman
- Southern Crop Protection and Food Research Centre, Agriculture and Agri-Food Canada, Government of Canada London, ON, Canada ; Department of Microbiology and Immunology, Schulich School of Medicine and Dentistry, University of Western Ontario London, ON, Canada
| | - Ze-Chun Yuan
- Southern Crop Protection and Food Research Centre, Agriculture and Agri-Food Canada, Government of Canada London, ON, Canada ; Department of Microbiology and Immunology, Schulich School of Medicine and Dentistry, University of Western Ontario London, ON, Canada
| |
Collapse
|
17
|
Adey A, Kitzman JO, Burton JN, Daza R, Kumar A, Christiansen L, Ronaghi M, Amini S, Gunderson KL, Steemers FJ, Shendure J. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Res 2014; 24:2041-9. [PMID: 25327137 PMCID: PMC4248320 DOI: 10.1101/gr.178319.114] [Citation(s) in RCA: 124] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
We describe a method that exploits contiguity preserving transposase sequencing (CPT-seq) to facilitate the scaffolding of de novo genome assemblies. CPT-seq is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to >1 megabase. These pools are “subhaploid,” in that the lengths of fragments contained in each pool sums to ∼5% to 10% of the full genome. The scaffolding approach described here, termed fragScaff, leverages coincidences between the content of different pools as a source of contiguity information. Specifically, CPT-seq data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate “joins” are used to construct a graph, which is then resolved by a minimum spanning tree. As a proof-of-concept, we apply CPT-seq and fragScaff to substantially boost the contiguity of de novo assemblies of the human, mouse, and fly genomes, increasing the scaffold N50 of de novo assemblies by eight- to 57-fold with high accuracy. We also demonstrate that fragScaff is complementary to Hi-C-based contact probability maps, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps. Finally, we demonstrate CPT-seq as a means of anchoring unplaced novel human contigs to the reference genome as well as for detecting misassembled sequences.
Collapse
Affiliation(s)
- Andrew Adey
- Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA
| | - Jacob O Kitzman
- Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA
| | - Joshua N Burton
- Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA
| | - Riza Daza
- Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA
| | - Akash Kumar
- Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA
| | - Lena Christiansen
- Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA
| | - Mostafa Ronaghi
- Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA
| | - Sasan Amini
- Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA
| | - Kevin L Gunderson
- Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA
| | - Frank J Steemers
- Illumina, Inc., Advanced Research Group, San Diego, California 92122, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, Washington 98115, USA;
| |
Collapse
|
18
|
Bao E, Jiang T, Girke T. AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics 2014; 30:i319-i328. [PMID: 24932000 PMCID: PMC4058956 DOI: 10.1093/bioinformatics/btu291] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation:De novo assemblies of genomes remain one of the most challenging applications in next-generation sequencing. Usually, their results are incomplete and fragmented into hundreds of contigs. Repeats in genomes and sequencing errors are the main reasons for these complications. With the rapidly growing number of sequenced genomes, it is now feasible to improve assemblies by guiding them with genomes from related species. Results: Here we introduce AlignGraph, an algorithm for extending and joining de novo-assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and preassembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the PE multipositional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7–62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9–94.5% and 80.3–165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph’s efficiency in improving genome assemblies by taking advantage of closely related references. Availability and implementation: The AlignGraph software can be downloaded for free from this site: https://github.com/baoe/AlignGraph. Contact:thomas.girke@ucr.edu
Collapse
Affiliation(s)
- Ergude Bao
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Tao Jiang
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| | - Thomas Girke
- Department of Computer Science and Engineering and Department of Botany and Plant Sciences, University of California, Riverside, CA 92521, USA
| |
Collapse
|
19
|
Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L. BESST--efficient scaffolding of large fragmented assemblies. BMC Bioinformatics 2014; 15:281. [PMID: 25128196 PMCID: PMC4262078 DOI: 10.1186/1471-2105-15-281] [Citation(s) in RCA: 117] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2014] [Accepted: 07/18/2014] [Indexed: 12/14/2022] Open
Abstract
Background The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features. We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software’s general performance. Results We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. Conclusion We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-281) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Science for Life Laboratory, School of Computer Science and Communication, KTH Royal Institute of Technology, Solna, Sweden.
| | | | | | | | | |
Collapse
|
20
|
Jiang Y, Xu P, Liu Z. Generation of physical map contig-specific sequences. Front Genet 2014; 5:243. [PMID: 25101119 PMCID: PMC4105628 DOI: 10.3389/fgene.2014.00243] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2014] [Accepted: 07/07/2014] [Indexed: 12/13/2022] Open
Abstract
Rapid advances of the next-generation sequencing technologies have allowed whole genome sequencing of many species. However, with the current sequencing technologies, the whole genome sequence assemblies often fall in short in one of the four quality measurements: accuracy, contiguity, connectivity, and completeness. In particular, small-sized contigs and scaffolds limit the applicability of whole genome sequences for genetic analysis. To enhance the quality of whole genome sequence assemblies, particularly the scaffolding capabilities, additional genomic resources are required. Among these, sequences derived from known physical locations offer great powers for scaffolding. In this mini-review, we will describe the principles, procedures and applications of physical-map-derived sequences, with the focus on physical map contig-specific sequences.
Collapse
Affiliation(s)
- Yanliang Jiang
- Centre for Applied Aquatic Genomics, Chinese Academy of Fishery Sciences Beijing, China
| | - Peng Xu
- Centre for Applied Aquatic Genomics, Chinese Academy of Fishery Sciences Beijing, China
| | - Zhanjiang Liu
- Aquatic Genomics Unit, The Fish Molecular Genetics and Biotechnology Laboratory, School of Fisheries, Aquaculture and Aquatic Sciences, and Program of Cell and Molecular Biosciences, Auburn University AL, USA
| |
Collapse
|
21
|
Huang KW, Chen JL, Yang CS, Tsai CW. A memetic particle swarm optimization algorithm for solving the DNA fragment assembly problem. Neural Comput Appl 2014. [DOI: 10.1007/s00521-014-1659-0] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
22
|
Hunt M, Newbold C, Berriman M, Otto TD. A comprehensive evaluation of assembly scaffolding tools. Genome Biol 2014; 15:R42. [PMID: 24581555 PMCID: PMC4053845 DOI: 10.1186/gb-2014-15-3-r42] [Citation(s) in RCA: 103] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Accepted: 03/03/2014] [Indexed: 11/12/2022] Open
Abstract
Background Genome assembly is typically a two-stage process: contig assembly followed by the use of paired sequencing reads to join contigs into scaffolds. Scaffolds are usually the focus of reported assembly statistics; longer scaffolds greatly facilitate the use of genome sequences in downstream analyses, and it is appealing to present larger numbers as metrics of assembly performance. However, scaffolds are highly prone to errors, especially when generated using short reads, which can directly result in inflated assembly statistics. Results Here we provide the first independent evaluation of scaffolding tools for second-generation sequencing data. We find large variations in the quality of results depending on the tool and dataset used. Even extremely simple test cases of perfect input, constructed to elucidate the behaviour of each algorithm, produced some surprising results. We further dissect the performance of the scaffolders using real and simulated sequencing data derived from the genomes of Staphylococcus aureus, Rhodobacter sphaeroides, Plasmodium falciparum and Homo sapiens. The results from simulated data are of high quality, with several of the tools producing perfect output. However, at least 10% of joins remains unidentified when using real data. Conclusions The scaffolders vary in their usability, speed and number of correct and missed joins made between contigs. Results from real data highlight opportunities for further improvements of the tools. Overall, SGA, SOPRA and SSPACE generally outperform the other tools on our datasets. However, the quality of the results is highly dependent on the read mapper and genome complexity.
Collapse
|
23
|
El-Metwally S, Ouda OM, Helmy M. Approaches and Challenges of Next-Generation Sequence Assembly Stages. NEXT GENERATION SEQUENCING TECHNOLOGIES AND CHALLENGES IN SEQUENCE ASSEMBLY 2014. [DOI: 10.1007/978-1-4939-0715-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
24
|
Tang B, Wang Q, Yang M, Xie F, Zhu Y, Zhuo Y, Wang S, Gao H, Ding X, Zhang L, Zhao G, Zheng H. ContigScape: a Cytoscape plugin facilitating microbial genome gap closing. BMC Genomics 2013; 14:289. [PMID: 23627759 PMCID: PMC3651407 DOI: 10.1186/1471-2164-14-289] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2012] [Accepted: 04/20/2013] [Indexed: 01/22/2023] Open
Abstract
Background With the emergence of next-generation sequencing, the availability of prokaryotic genome sequences is expanding rapidly. A total of 5,276 genomes have been released since 2008, yet only 1,692 genomes were complete. The final phase of microbial genome sequencing, particularly gap closing, is frequently the rate-limiting step either because of complex genomic structures that cause sequence bias even with high genomic coverage, or the presence of repeat sequences that may cause gaps in assembly. Results We have developed a Cytoscape plugin to facilitate gap closing for high-throughput sequencing data from microbial genomes. This plugin is capable of interactively displaying the relationships among genomic contigs derived from various sequencing formats. The sequence contigs of plasmids and special repeats (IS elements, ribosomal RNAs, terminal repeats, etc.) can be displayed as well. Conclusions Displaying relationships between contigs using graphs in Cytoscape rather than tables provides a more straightforward visual representation. This will facilitate a faster and more precise determination of the linkages among contigs and greatly improve the efficiency of gap closing.
Collapse
Affiliation(s)
- Biao Tang
- State Key Laboratory of Genetic Engineering, Department of Microbiology, School of Life Sciences, Fudan University, Shanghai 200433, China
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|