1
|
Ping P, Li J. Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing. BIOINFORMATICS ADVANCES 2025; 5:vbaf081. [PMID: 40303904 PMCID: PMC12040381 DOI: 10.1093/bioadv/vbaf081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 02/16/2025] [Accepted: 04/08/2025] [Indexed: 05/02/2025]
Abstract
Motivation Pairs of short reads with small edit distances, along with their unique molecular identifier tags, have been exploited to correct sequencing errors in both reads and tags. However, brute-force identification of these pairs is impractical for large datasets containing ten million or more reads due to its quadratic complexity. Minimizer-bucketing and locality-sensitive hashing have been used to partition read sets into buckets of similar reads, allowing edit-distance calculations only within each bucket. However, challenges like minimizing missing pairs, optimizing bucketing parameters, and exploring combination bucketing to improve pair detection remain. Results We define an edit-distance graph for a set of short reads, where nodes represent reads, and edges connect reads with small edit distances, and present a heuristic method, reads2graph, for high completeness of edge detection. Reads2graph uses three techniques: minimizer-bucketing, an improved Order-Min-Hash technique to divide large bins, and a novel graph neighbourhood multi-hop traversal within large bins to detect more edges. We then establish optimal bucketing settings to maximize ground truth edge coverage per bin. Extensive testing demonstrates that read2graph can achieve 97%-100% completeness in most cases, outperforming brute-force identification in speed while providing a superior speed-completeness balance compared to using a single bucketing method like Miniception or Order-Min-Hash. Availability and implementation reads2graph is publicly available at https://github.com/JappyPing/reads2graph.
Collapse
Affiliation(s)
- Pengyao Ping
- School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Jinyan Li
- School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, Australia
- School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong 518000, China
| |
Collapse
|
2
|
Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol 2024; 25:270. [PMID: 39402664 PMCID: PMC11472564 DOI: 10.1186/s13059-024-03414-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open
Abstract
The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.
Collapse
Affiliation(s)
- Malick Ndiaye
- Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland
| | - Silvia Prieto-Baños
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | - Sergey Oreshkov
- Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Natasha Glover
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sina Majidian
- Department of Computational Biology, UNIL, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
3
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
4
|
Luo X, Wang Y, Zou Q, Xu L. Recall DNA methylation levels at low coverage sites using a CNN model in WGBS. PLoS Comput Biol 2023; 19:e1011205. [PMID: 37315069 DOI: 10.1371/journal.pcbi.1011205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 05/22/2023] [Indexed: 06/16/2023] Open
Abstract
DNA methylation is an important regulator of gene transcription. WGBS is the gold-standard approach for base-pair resolution quantitative of DNA methylation. It requires high sequencing depth. Many CpG sites with insufficient coverage in the WGBS data, resulting in inaccurate DNA methylation levels of individual sites. Many state-of-arts computation methods were proposed to predict the missing value. However, many methods required either other omics datasets or other cross-sample data. And most of them only predicted the state of DNA methylation. In this study, we proposed the RcWGBS, which can impute the missing (or low coverage) values from the DNA methylation levels on the adjacent sides. Deep learning techniques were employed for the accurate prediction. The WGBS datasets of H1-hESC and GM12878 were down-sampled. The average difference between the DNA methylation level at 12× depth predicted by RcWGBS and that at >50× depth in the H1-hESC and GM2878 cells are less than 0.03 and 0.01, respectively. RcWGBS performed better than METHimpute even though the sequencing depth was as low as 12×. Our work would help to process methylation data of low sequencing depth. It is beneficial for researchers to save sequencing costs and improve data utilization through computational methods.
Collapse
Affiliation(s)
- Ximei Luo
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Yansu Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
| |
Collapse
|
5
|
Identification of DNA-binding proteins via Multi-view LSSVM with independence criterion. Methods 2022; 207:29-37. [PMID: 36087888 DOI: 10.1016/j.ymeth.2022.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 08/06/2022] [Accepted: 08/25/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins actively participate in life activities such as DNA replication, recombination, gene expression and regulation and play a prominent role in these processes. As DNA-binding proteins continue to be discovered and increase, it is imperative to design an efficient and accurate identification tool. Considering the time-consuming and expensive traditional experimental technology and the insufficient number of samples in the biological computing method based on structural information, we proposed a machine learning algorithm based on sequence information to identify DNA binding proteins, named multi-view Least Squares Support Vector Machine via Hilbert-Schmidt Independence Criterion (multi-view LSSVM via HSIC). This method took 6 feature sets as multi-view input and trains a single view through the LSSVM algorithm. Then, we integrated HSIC into LSSVM as a regular term to reduce the dependence between views and explored the complementary information of multiple views. Subsequently, we trained and coordinated the submodels and finally combined the submodels in the form of weights to obtain the final prediction model. On training set PDB1075, the prediction results of our model were better than those of most existing methods. Independent tests are conducted on the datasets PDB186 and PDB2272. The accuracy of the prediction results was 85.5% and 79.36%, respectively. This result exceeded the current state-of-the-art methods, which showed that the multi-view LSSVM via HSIC can be used as an efficient predictor.
Collapse
|
6
|
Yao H, Hu G, Liu S, Fang H, Ji Y. SparkGC: Spark based genome compression for large collections of genomes. BMC Bioinformatics 2022; 23:297. [PMID: 35879669 PMCID: PMC9310413 DOI: 10.1186/s12859-022-04825-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 07/06/2022] [Indexed: 11/23/2022] Open
Abstract
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC.
Collapse
Affiliation(s)
- Haichang Yao
- School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
| | - Guangyong Hu
- School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
| | - Shangdong Liu
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Houzhi Fang
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Yimu Ji
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China. .,Jiangsu HPC and Intelligent Processing Engineer Research Center, Nanjing, 210003, China. .,Institute of High Performance Computing and Bigdata, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.
| |
Collapse
|
7
|
Tang T, Hutvagner G, Wang W, Li J. Simultaneous compression of multiple error-corrected short-read sets for faster data transmission and better de novo assemblies. Brief Funct Genomics 2022; 21:387-398. [PMID: 35848773 DOI: 10.1093/bfgp/elac016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/14/2022] Open
Abstract
Next-Generation Sequencing has produced incredible amounts of short-reads sequence data for de novo genome assembly over the last decades. For efficient transmission of these huge datasets, high-performance compression algorithms have been intensively studied. As both the de novo assembly and error correction methods utilize the overlaps between reads data, a concern is that the will the sequencing errors bring up negative effects on genome assemblies also affect the compression of the NGS data. This work addresses two problems: how current error correction algorithms can enable the compression algorithms to make the sequence data much more compact, and whether the sequence-modified reads by the error-correction algorithms will lead to quality improvement for de novo contig assembly. As multiple sets of short reads are often produced by a single biomedical project in practice, we propose a graph-based method to reorder the files in the collection of multiple sets and then compress them simultaneously for a further compression improvement after error correction. We use examples to illustrate that accurate error correction algorithms can significantly reduce the number of mismatched nucleotides in the reference-free compression, hence can greatly improve the compression performance. Extensive test on practical collections of multiple short-read sets does confirm that the compression performance on the error-corrected data (with unchanged size) significantly outperforms that on the original data, and that the file reordering idea contributes furthermore. The error correction on the original reads has also resulted in quality improvements of the genome assemblies, sometimes remarkably. However, it is still an open question that how to combine appropriate error correction methods with an assembly algorithm so that the assembly performance can be always significantly improved.
Collapse
Affiliation(s)
- Tao Tang
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia.,School of Mordern Posts, Nanjing University of Posts and Telecommunications, 9 Wenyuan Rd, Qixia District, 210003, Jiangsu, China
| | - Gyorgy Hutvagner
- School of Biomedical Engineering, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| | - Wenjian Wang
- School of Computer and Information Technology, Shanxi University, Shanxi Road, 030006, Shanxi, China
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, 81 Broadway, Ultimo, 2007, NSW, Australia
| |
Collapse
|
8
|
Zhang H, Zou Q, Ju Y, Song C, Chen D. Distance-based support vector machine to predict DNA N6-methyladenine modification. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220404145517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
DNA N6-methyladenine plays an important role in the restriction-modification system to isolate invasion from adventive DNA. The shortcomings of the high time-consumption and high costs of experimental methods have been exposed, and some computational methods have emerged. The support vector machine theory has received extensive attention in the bioinformatics field due to its solid theoretical foundation and many good characteristics.
Objective:
General machine learning methods include an important step of extracting features. The research has omitted this step and replaced with easy-to-obtain sequence distances matrix to obtain better results
Method:
First sequence alignment technology was used to achieve the similarity matrix. Then a novel transformation turned the similarity matrix into a distance matrix. Next, the similarity-distance matrix is made positive semi-definite so that it can be used in the kernel matrix. Finally, the LIBSVM software was applied to solve the support vector machine.
Results:
The five-fold cross-validation of this model on rice and mouse data has achieved excellent accuracy rates of 92.04% and 96.51%, respectively. This shows that the DB-SVM method has obvious advantages compared with traditional machine learning methods. Meanwhile this model achieved 0.943,0.982 and 0.818 accuracy,0.944, 0.982, and 0.838 Matthews correlation coefficient and 0.942, 0.982 and 0.840 F1 scores for the rice, M. musculus and cross-species genome datasets, respectively.
Conclusion:
These outcomes show that this model outperforms the iIM-CNN and csDMA in the prediction of DNA 6mA modification, which are the lastest research on DNA 6mA.
Collapse
Affiliation(s)
- Haoyu Zhang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610051, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chenggang Song
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou 324000, China
| |
Collapse
|
9
|
Zhao Z, Yang W, Zhai Y, Liang Y, Zhao Y. Identify DNA-Binding Proteins Through the Extreme Gradient Boosting Algorithm. Front Genet 2022; 12:821996. [PMID: 35154264 PMCID: PMC8837382 DOI: 10.3389/fgene.2021.821996] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 12/07/2021] [Indexed: 12/13/2022] Open
Abstract
The exploration of DNA-binding proteins (DBPs) is an important aspect of studying biological life activities. Research on life activities requires the support of scientific research results on DBPs. The decline in many life activities is closely related to DBPs. Generally, the detection method for identifying DBPs is achieved through biochemical experiments. This method is inefficient and requires considerable manpower, material resources and time. At present, several computational approaches have been developed to detect DBPs, among which machine learning (ML) algorithm-based computational techniques have shown excellent performance. In our experiments, our method uses fewer features and simpler recognition methods than other methods and simultaneously obtains satisfactory results. First, we use six feature extraction methods to extract sequence features from the same group of DBPs. Then, this feature information is spliced together, and the data are standardized. Finally, the extreme gradient boosting (XGBoost) model is used to construct an effective predictive model. Compared with other excellent methods, our proposed method has achieved better results. The accuracy achieved by our method is 78.26% for PDB2272 and 85.48% for PDB186. The accuracy of the experimental results achieved by our strategy is similar to that of previous detection methods.
Collapse
Affiliation(s)
- Ziye Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, Shenzhen, China
| | - Yixiao Zhai
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yingjian Liang
- Department of Obstetrics and Gynecology, The First Affiliated Hospital of Harbin Medical University, Harbin, China
- *Correspondence: Yingjian Liang, ; Yuming Zhao,
| | - Yuming Zhao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Yingjian Liang, ; Yuming Zhao,
| |
Collapse
|
10
|
Guo Y, Ju Y, Chen D, Wang L. Research on the Computational Prediction of Essential Genes. Front Cell Dev Biol 2021; 9:803608. [PMID: 34938741 PMCID: PMC8685449 DOI: 10.3389/fcell.2021.803608] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 11/22/2021] [Indexed: 11/19/2022] Open
Abstract
Genes, the nucleotide sequences that encode a polypeptide chain or functional RNA, are the basic genetic unit controlling biological traits. They are the guarantee of the basic structures and functions in organisms, and they store information related to biological factors and processes such as blood type, gestation, growth, and apoptosis. The environment and genetics jointly affect important physiological processes such as reproduction, cell division, and protein synthesis. Genes are related to a wide range of phenomena including growth, decline, illness, aging, and death. During the evolution of organisms, there is a class of genes that exist in a conserved form in multiple species. These genes are often located on the dominant strand of DNA and tend to have higher expression levels. The protein encoded by it usually either performs very important functions or is responsible for maintaining and repairing these essential functions. Such genes are called persistent genes. Among them, the irreplaceable part of the body’s life activities is the essential gene. For example, when starch is the only source of energy, the genes related to starch digestion are essential genes. Without them, the organism will die because it cannot obtain enough energy to maintain basic functions. The function of the proteins encoded by these genes is thought to be fundamental to life. Nowadays, DNA can be extracted from blood, saliva, or tissue cells for genetic testing, and detailed genetic information can be obtained using the most advanced scientific instruments and technologies. The information gained from genetic testing is useful to assess the potential risks of disease, and to help determine the prognosis and development of diseases. Such information is also useful for developing personalized medication and providing targeted health guidance to improve the quality of life. Therefore, it is of great theoretical and practical significance to identify important and essential genes. In this paper, the research status of essential genes and the essential genome database of bacteria are reviewed, the computational prediction method of essential genes based on communication coding theory is expounded, and the significance and practical application value of essential genes are discussed.
Collapse
Affiliation(s)
- Yuxin Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China.,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China.,Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China.,School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Dong Chen
- College of Electrical and Information Engineering, Quzhou University, Quzhou, China
| | - Lihong Wang
- Beidahuang Industry Group General Hospital, Harbin, China
| |
Collapse
|
11
|
Liu Y, Li J. Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression. PLoS Comput Biol 2021; 17:e1009229. [PMID: 34280186 PMCID: PMC8321399 DOI: 10.1371/journal.pcbi.1009229] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 07/29/2021] [Accepted: 06/30/2021] [Indexed: 11/21/2022] Open
Abstract
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.
Collapse
Affiliation(s)
- Yuansheng Liu
- Data Science Institute, University of Technology Sydney, Sydney, Australia
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Sydney, Australia
| |
Collapse
|
12
|
Zhang X, Liu Y, Yu Z, Blumenstein M, Hutvagner G, Li J. Instance-based error correction for short reads of disease-associated genes. BMC Bioinformatics 2021; 22:142. [PMID: 34078284 PMCID: PMC8170817 DOI: 10.1186/s12859-021-04058-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 03/02/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Genomic reads from sequencing platforms contain random errors. Global correction algorithms have been developed, aiming to rectify all possible errors in the reads using generic genome-wide patterns. However, the non-uniform sequencing depths hinder the global approach to conduct effective error removal. As some genes may get under-corrected or over-corrected by the global approach, we conduct instance-based error correction for short reads of disease-associated genes or pathways. The paramount requirement is to ensure the relevant reads, instead of the whole genome, are error-free to provide significant benefits for single-nucleotide polymorphism (SNP) or variant calling studies on the specific genes. RESULTS To rectify possible errors in the short reads of disease-associated genes, our novel idea is to exploit local sequence features and statistics directly related to these genes. Extensive experiments are conducted in comparison with state-of-the-art methods on both simulated and real datasets of lung cancer associated genes (including single-end and paired-end reads). The results demonstrated the superiority of our method with the best performance on precision, recall and gain rate, as well as on sequence assembly results (e.g., N50, the length of contig and contig quality). CONCLUSION Instance-based strategy makes it possible to explore fine-grained patterns focusing on specific genes, providing high precision error correction and convincing gene sequence assembly. SNP case studies show that errors occurring at some traditional SNP areas can be accurately corrected, providing high precision and sensitivity for investigations on disease-causing point mutations.
Collapse
Affiliation(s)
- Xuan Zhang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, China
| | - Michael Blumenstein
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Gyorgy Hutvagner
- Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia.
| |
Collapse
|
13
|
Tang T, Li J. Transformation of FASTA files into feature vectors for unsupervised compression of short reads databases. J Bioinform Comput Biol 2021; 19:2050048. [PMID: 33472569 DOI: 10.1142/s0219720020500481] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
Abstract
FASTA data sets of short reads are usually generated in tens or hundreds for a biomedical study. However, current compression of these data sets is carried out one-by-one without consideration of the inter-similarity between the data sets which can be otherwise exploited to enhance compression performance of de novo compression. We show that clustering these data sets into similar sub-groups for a group-by-group compression can greatly improve the compression performance. Our novel idea is to detect the lexicographically smallest k-mer (k-minimizer) for every read in each data set, and uses these k-mers as features and their frequencies in every data set as feature values to transform these huge data sets each into a characteristic feature vector. Unsupervised clustering algorithms are then applied to these vectors to find similar data sets and merge them. As the amount of common k-mers of similar feature values between two data sets implies an excessive proportion of overlapping reads shared between the two data sets, merging similar data sets creates immense sequence redundancy to boost the compression performance. Experiments confirm that our clustering approach can gain up to 12% improvement over several state-of-the-art algorithms in compressing reads databases consisting of 17-100 data sets (48.57-197.97[Formula: see text]GB).
Collapse
Affiliation(s)
- Tao Tang
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Broadway, NSW 2007, Australia
| |
Collapse
|