1
|
Orlov YL, Orlova NG. Bioinformatics tools for the sequence complexity estimates. Biophys Rev 2023; 15:1367-1378. [PMID: 37974990 PMCID: PMC10643780 DOI: 10.1007/s12551-023-01140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/01/2023] [Indexed: 11/19/2023] Open
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
2
|
Winkler J, Urgese G, Ficarra E, Reinert K. LaRA 2: parallel and vectorized program for sequence-structure alignment of RNA sequences. BMC Bioinformatics 2022; 23:18. [PMID: 34991448 PMCID: PMC8734264 DOI: 10.1186/s12859-021-04532-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 12/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The function of non-coding RNA sequences is largely determined by their spatial conformation, namely the secondary structure of the molecule, formed by Watson-Crick interactions between nucleotides. Hence, modern RNA alignment algorithms routinely take structural information into account. In order to discover yet unknown RNA families and infer their possible functions, the structural alignment of RNAs is an essential task. This task demands a lot of computational resources, especially for aligning many long sequences, and it therefore requires efficient algorithms that utilize modern hardware when available. A subset of the secondary structures contains overlapping interactions (called pseudoknots), which add additional complexity to the problem and are often ignored in available software. RESULTS We present the SeqAn-based software LaRA 2 that is significantly faster than comparable software for accurate pairwise and multiple alignments of structured RNA sequences. In contrast to other programs our approach can handle arbitrary pseudoknots. As an improved re-implementation of the LaRA tool for structural alignments, LaRA 2 uses multi-threading and vectorization for parallel execution and a new heuristic for computing a lower boundary of the solution. Our algorithmic improvements yield a program that is up to 130 times faster than the previous version. CONCLUSIONS With LaRA 2 we provide a tool to analyse large sets of RNA secondary structures in relatively short time, based on structural alignment. The produced alignments can be used to derive structural motifs for the search in genomic databases.
Collapse
Affiliation(s)
- Jörg Winkler
- Department of Mathematics and Computer Science, Free University Berlin, Takustraße 9, 14195 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| | - Gianvito Urgese
- Interuniversity Department of Regional and Urban Studies and Planning, Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Turin, Italy
| | - Elisa Ficarra
- Department of Control and Computer Science, Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Turin, Italy
| | - Knut Reinert
- Department of Mathematics and Computer Science, Free University Berlin, Takustraße 9, 14195 Berlin, Germany
- Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| |
Collapse
|
3
|
Liu Y, Zhang X, Zou Q, Zeng X. Minirmd: accurate and fast duplicate removal tool for short reads via multiple minimizers. Bioinformatics 2021; 37:1604-1606. [PMID: 33112385 DOI: 10.1093/bioinformatics/btaa915] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 09/30/2020] [Accepted: 10/14/2020] [Indexed: 12/21/2022] Open
Abstract
SUMMARY Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minirmd. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China
| | - Xiaocai Zhang
- Advanced Analytics Institute, University of Technology Sydney, Broadway, NSW 2007, Australia
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, Changsha, Hunan 410012, China
| |
Collapse
|
4
|
Jeong J, Park SJ, Kim JW, No JS, Jeon HH, Lee JW, No A, Kim S, Park H. Cooperative Sequence Clustering and Decoding for DNA Storage System with Fountain Codes. Bioinformatics 2021; 37:3136-3143. [PMID: 33904574 DOI: 10.1093/bioinformatics/btab246] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 03/03/2021] [Accepted: 04/13/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION In DNA storage systems, there are tradeoffs between writing and reading costs. Increasing the code rate of error-correcting codes may save writing cost, but it will need more sequence reads for data retrieval. There is potentially a way to improve sequencing and decoding processes in such a way that the reading cost induced by this tradeoff is reduced without increasing the writing cost. In past researches, clustering, alignment, and decoding processes were considered as separate stages but we believe that using the information from all these processes together may improve decoding performance. Actual experiments of DNA synthesis and sequencing should be performed because simulations cannot be relied on to cover all error possibilities in practical circumstances. RESULTS For DNA storage systems using fountain code and Reed-Solomon (RS) code, we introduce several techniques to improve the decoding performance. We designed the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, RS error correction as well as detection, and quality score-based ordering of sequences. We synthesized 513.6KB data into DNA oligo pools and sequenced this data successfully with Illumina MiSeq instrument. Compared to Erlich's research, the proposed decoding method additionally incorporates sequence reads with minor errors which had been discarded before, and thuswas able to make use of 10.6-11.9% more sequence reads from the same sequencing environment, this resulted in 6.5-8.9% reduction in the reading cost. Channel characteristics including sequence coverage and read-length distributions are provided as well. AVAILABILITY The raw data files and the source codes of our experiments are available at: https://github.com/jhjeong0702/dna-storage.
Collapse
Affiliation(s)
- Jaeho Jeong
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Seong-Joon Park
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Jae-Won Kim
- Department of Electronic Engineering, Gyeongsang National University, Jinju, Korea
| | - Jong-Seon No
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Ha Hyeon Jeon
- Department of Chemical Engineering, POSTECH, Pohang, Korea
| | - Jeong Wook Lee
- Department of Chemical Engineering, POSTECH, Pohang, Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Korea
| | - Sunghwan Kim
- School of Electrical Engineering, University of Ulsan, Ulsan, Korea
| | - Hosung Park
- Department of Computer Engineering and Department of ICT Convergence System Engineering, Chonnam National University, Gwangju, Korea
| |
Collapse
|
5
|
Zhou Z, Gu G, Luo Y, Li W, Li B, Zhao Y, Liu J, Shuai X, Wu L, Chen J, Fan C, Huang Q, Han B, Wen J, Jiao H. Immunological pathways of macrophage response to Brucella ovis infection. Innate Immun 2020; 26:635-648. [PMID: 32970502 PMCID: PMC7556187 DOI: 10.1177/1753425920958179] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
As the molecular mechanisms of Brucella ovis pathogenicity are not completely clear, we have applied a transcriptome approach to identify the differentially expressed genes (DEGs) in RAW264.7 macrophage infected with B. ovis. The DEGs related to immune pathway were identified by Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) functional enrichment analysis. Quantitative real-time PCR (qRT-PCR) was performed to validate the transcriptome sequencing data. In total, we identified 337 up-regulated and 264 down-regulated DEGs in B. ovis-infected group versus mock group. Top 20 pathways were enriched by KEGG analysis and 20 GO by functional enrichment analysis in DEGs involved in the molecular function, cellular component, and biological process and so on, which revealed multiple immunological pathways in RAW264.7 macrophage cells in response to B. ovis infection, including inflammatory response, immune system process, immune response, cytokine activity, chemotaxis, chemokine-mediated signaling pathway, chemokine activity, and CCR chemokine receptor binding. qRT-PCR results showed Ccl2 (ENSMUST00000000193), Ccl2 (ENSMUST00000124479), Ccl3 (ENSMUST00000001008), Hmox1 (ENSMUST00000005548), Hmox1 (ENSMUST00000159631), Cxcl2 (ENSMUST00000075433), Cxcl2 (ENSMUST00000200681), Cxcl2 (ENSMUST00000200919), and Cxcl2 (ENSMUST00000202317). Our findings firstly elucidate the pathways involved in B. ovis-induced host immune response, which may lay the foundation for revealing the bacteria–host interaction and demonstrating the pathogenic mechanism of B. ovis.
Collapse
Affiliation(s)
- Zhixiong Zhou
- College of Veterinary Medicine, Southwest University, Chongqing, China
| | - Guojing Gu
- College of Veterinary Medicine, Southwest University, Chongqing, China
| | - Yichen Luo
- Immunology Research Center, Medical Research Institute, Southwest University, Chongqing, China.,College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| | - Wenjie Li
- College of Veterinary Medicine, Southwest University, Chongqing, China
| | - Bowen Li
- College of Veterinary Medicine, Southwest University, Chongqing, China
| | - Yu Zhao
- College of Veterinary Medicine, Southwest University, Chongqing, China
| | - Juan Liu
- Immunology Research Center, Medical Research Institute, Southwest University, Chongqing, China.,College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| | - Xuehong Shuai
- Immunology Research Center, Medical Research Institute, Southwest University, Chongqing, China.,College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| | - Li Wu
- College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| | - Jixuan Chen
- College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| | - Cailiang Fan
- College of Veterinary Medicine, Southwest University, Chongqing, China.,Animal Disease Prevention and Control Center of Rongchang, Chongqing, China
| | - Qingzhou Huang
- College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| | - Baoru Han
- College of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Jianjun Wen
- Department of Microbiology and Immunology, University of Texas Medical Branch at Galveston, Galveston, USA
| | - Hanwei Jiao
- Immunology Research Center, Medical Research Institute, Southwest University, Chongqing, China.,College of Veterinary Medicine, Southwest University, Chongqing, China.,Veterinary Scientific Engineering Research Center, Chongqing, China
| |
Collapse
|