1
|
Li M, Guo X, Zhao J. VirDiG: a de novo transcriptome assembler for coronavirus. BIOINFORMATICS ADVANCES 2025; 5:vbaf075. [PMID: 40291015 PMCID: PMC12034387 DOI: 10.1093/bioadv/vbaf075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 02/18/2025] [Accepted: 04/07/2025] [Indexed: 04/30/2025]
Abstract
Motivation The discontinuous transcription mechanism of coronaviruses contributes to their adaptation to different host environments and plays a critical role in their lifecycle. Accurate assembly of coronavirus transcripts is vital for understanding the virus's biological traits and developing precise prevention and treatment strategies. However, existing de novo assembly algorithms are primarily designed for alternative splicing events in eukaryotes and are not suitable for assembling coronavirus transcriptome, which consists of both genomic RNA and subgenomic mRNAs. Coronavirus transcriptome reconstruction from short reads remains a challenging problem. Results In this work, we present VirDiG, a de novo transcriptome assembler specifically designed for coronaviruses. VirDiG utilizes a discontinuous graph to facilitate accurate transcript assembly by incorporating information from paired-end reads, sequence depth, and start and stop codons. Experimental results from both simulated and real datasets show that VirDiG exhibits significant advantages in reconstructing the transcriptome of coronaviruses when compared to traditional de novo assemblers tailored for classical eukaryotic transcriptome assembly. Availability and implementation VirDiG is freely available at https://github.com/Limh616/VirDiG.git.
Collapse
Affiliation(s)
- Minghao Li
- School of Computer Science and Technology, Qingdao University, Shandong 266071, China
| | - Xuaoyu Guo
- School of Computer Science and Technology, Qingdao University, Shandong 266071, China
| | - Jin Zhao
- School of Computer Science and Technology, Qingdao University, Shandong 266071, China
| |
Collapse
|
2
|
Sergio Alberto G, Maximo R, Andres R, Sergio L, Norma P. Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets. Bioinform Biol Insights 2024; 18:11779322241274957. [PMID: 39649541 PMCID: PMC11622296 DOI: 10.1177/11779322241274957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 07/25/2024] [Indexed: 12/11/2024] Open
Abstract
De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
Collapse
Affiliation(s)
- Gonzalez Sergio Alberto
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), CICVyA, Instituto Nacional de Tecnología Agropecuaria (INTA), Buenos Aires, Argentina
| | - Rivarola Maximo
- Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Ribone Andres
- Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| | - Lew Sergio
- Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
- Instituto de Ingeniería Biomédica, Facultad de Ingeniería, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Paniego Norma
- Instituto de Agrobiotecnología y Biología Molecular (IABIMO), CICVyA, Instituto Nacional de Tecnología Agropecuaria (INTA), Buenos Aires, Argentina
- Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, Argentina
| |
Collapse
|
3
|
Ellsworth SA, Rautsaw RM, Ward MJ, Holding ML, Rokyta DR. Selection Across the Three-Dimensional Structure of Venom Proteins from North American Scolopendromorph Centipedes. J Mol Evol 2024:10.1007/s00239-024-10191-y. [PMID: 39026042 DOI: 10.1007/s00239-024-10191-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 07/09/2024] [Indexed: 07/20/2024]
Abstract
Gene duplication followed by nucleotide differentiation is one of the simplest mechanisms to develop new functions for genes. However, the evolutionary processes underlying the divergence of multigene families remain controversial. We used multigene families found within the diversity of toxic proteins in centipede venom to test two hypotheses related to venom evolution: the two-speed mode of venom evolution and the rapid accumulation of variation in exposed residues (RAVER) model. The two-speed mode of venom evolution proposes that different types of selection impact ancient and younger venomous lineages with negative selection being the predominant form in ancient lineages and positive selection being the dominant form in younger lineages. The RAVER hypothesis proposes that, instead of different types of selection acting on different ages of venomous lineages, the different types of selection will selectively contribute to amino acid variation based on whether the residue is exposed to the solvent where it can potentially interact directly with toxin targets. This hypothesis parallels the longstanding understanding of protein evolution that suggests that residues found within the structural or active regions of the protein will be under negative or purifying selection, and residues that do not form part of these areas will be more prone to positive selection. To test these two hypotheses, we compared the venom of 26 centipedes from the order Scolopendromorpha from six currently recognized species from across North America using both transcriptomics and proteomics. We first estimated their phylogenetic relationships and uncovered paraphyly among the genus Scolopendra and evidence for cryptic diversity among currently recognized species. Using our phylogeny, we then characterized the diverse venom components from across the identified clades using a combination of transcriptomics and proteomics. We conducted selection-based analyses in the context of predicted three-dimensional properties of the venom proteins and found support for both hypotheses. Consistent with the two-speed hypothesis, we found a prevalence of negative selection across all proteins. Consistent with the RAVER hypothesis, we found evidence of positive selection on solvent-exposed residues, with structural and less-exposed residues showing stronger signal for negative selection. Through the use of phylogenetics, transcriptomics, proteomics, and selection-based analyses, we were able to describe the evolution of venom from an ancient venomous lineage and support principles of protein evolution that directly relate to multigene family evolution.
Collapse
Affiliation(s)
- Schyler A Ellsworth
- Department of Biological Science, Florida State University, Tallahassee, FL, 32306, USA
| | - Rhett M Rautsaw
- Department of Integrative Biology, University of South Florida, Tampa, FL, 33620, USA
- School of Biological Sciences, Washington State University, Pullman, WA, 99164, USA
| | - Micaiah J Ward
- Department of Biological Science, Florida State University, Tallahassee, FL, 32306, USA
| | - Matthew L Holding
- Department of Biological Science, Florida State University, Tallahassee, FL, 32306, USA
- Life Sciences Institute, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Darin R Rokyta
- Department of Biological Science, Florida State University, Tallahassee, FL, 32306, USA.
| |
Collapse
|
4
|
Lim PK, Wang R, Mutwil M. LSTrAP-denovo: Automated Generation of Transcriptome Atlases for Eukaryotic Species Without Genomes. PHYSIOLOGIA PLANTARUM 2024; 176:e14407. [PMID: 38973613 DOI: 10.1111/ppl.14407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 05/28/2024] [Indexed: 07/09/2024]
Abstract
Despite the abundance of species with transcriptomic data, a significant number of species still lack sequenced genomes, making it difficult to study gene function and expression in these organisms. While de novo transcriptome assembly can be used to assemble protein-coding transcripts from RNA-sequencing (RNA-seq) data, the datasets used often only feature samples of arbitrarily selected or similar experimental conditions, which might fail to capture condition-specific transcripts. We developed the Large-Scale Transcriptome Assembly Pipeline for de novo assembled transcripts (LSTrAP-denovo) to automatically generate transcriptome atlases of eukaryotic species. Specifically, given an NCBI TaxID, LSTrAP-denovo can (1) filter undesirable RNA-seq accessions based on read data, (2) select RNA-seq accessions via unsupervised machine learning to construct a sample-balanced dataset for download, (3) assemble transcripts via over-assembly, (4) functionally annotate coding sequences (CDS) from assembled transcripts and (5) generate transcriptome atlases in the form of expression matrices for downstream transcriptomic analyses. LSTrAP-denovo is easy to implement, written in Python, and is freely available at https://github.com/pengkenlim/LSTrAP-denovo/.
Collapse
Affiliation(s)
- Peng Ken Lim
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Ruoxi Wang
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
5
|
Westrin KJ, Kretzschmar WW, Emanuelsson O. ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs. BMC Bioinformatics 2024; 25:54. [PMID: 38302873 PMCID: PMC10836024 DOI: 10.1186/s12859-024-05663-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 01/18/2024] [Indexed: 02/03/2024] Open
Abstract
BACKGROUND Transcriptome assembly from RNA-sequencing data in species without a reliable reference genome has to be performed de novo, but studies have shown that de novo methods often have inadequate ability to reconstruct transcript isoforms. We address this issue by constructing an assembly pipeline whose main purpose is to produce a comprehensive set of transcript isoforms. RESULTS We present the de novo transcript isoform assembler ClusTrast, which takes short read RNA-seq data as input, assembles a primary assembly, clusters a set of guiding contigs, aligns the short reads to the guiding contigs, assembles each clustered set of short reads individually, and merges the primary and clusterwise assemblies into the final assembly. We tested ClusTrast on real datasets from six eukaryotic species, and showed that ClusTrast reconstructed more expressed known isoforms than any of the other tested de novo assemblers, at a moderate reduction in precision. For recall, ClusTrast was on top in the lower end of expression levels (<15% percentile) for all tested datasets, and over the entire range for almost all datasets. Reference transcripts were often (35-69% for the six datasets) reconstructed to at least 95% of their length by ClusTrast, and more than half of reference transcripts (58-81%) were reconstructed with contigs that exhibited polymorphism, measuring on a subset of reliably predicted contigs. ClusTrast recall increased when using a union of assembled transcripts from more than one assembly tool as primary assembly. CONCLUSION We suggest that ClusTrast can be a useful tool for studying isoforms in species without a reliable reference genome, in particular when the goal is to produce a comprehensive transcriptome set with polymorphic variants.
Collapse
Affiliation(s)
- Karl Johan Westrin
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden
| | - Warren W Kretzschmar
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden
- Department of Medicine Huddinge, Center for Hematology and Regenerative Medicine (HERM), Karolinska Institute, 141 52, Flemingsberg, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden.
| |
Collapse
|
6
|
Lee J, Kim M, Han K, Yoon S. StringFix: an annotation-guided transcriptome assembler improves the recovery of amino acid sequences from RNA-Seq reads. Genes Genomics 2023; 45:1599-1609. [PMID: 37837515 DOI: 10.1007/s13258-023-01458-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 10/01/2023] [Indexed: 10/16/2023]
Abstract
BACKGROUND Reconstruction of amino acid sequences from assembled transcriptome is of interest in personalized medicine, for example, to predict drug-target (or protein-protein) interaction considering individual's genomic variations. Most of the existing transcriptome assemblers, however, seems not well suited for this purpose. METHODS In this work, we present StringFix, an annotation guided transcriptome assembly and protein sequence reconstruction software tool that takes genome-aligned reads and the annotations associated to the reference genome as input. The tool 'fixes' the pre-annotated transcript sequence by taking small variations into account, finally to produce possible amino acid sequences that are likely to exist in the test tissue. RESULTS The results show that, using outputs from existing reference-based assemblers as the input GTF-guide, StringFix could reconstruct amino acid sequences more precisely with higher sensitivity than direct generation using the recovered transcripts from all the assemblers we tested. CONCLUSION By using StringFix with the existing reference-based assemblers, one can recover not only a novel transcripts and isoforms but also the possible amino acid sequence stemming from them.
Collapse
Affiliation(s)
- Joongho Lee
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Minsoo Kim
- Dept. of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Kyudong Han
- Center for Bio-Medical Engineering Core Facility, Dankook Univ, Cheonan, 31116, Korea
- Dept. of Microbiology, College of Science & Technology, Dankook Univ, Cheonan, 31116, Korea
- HuNbiome Co., Ltd, R&D Center, Seoul, 08503, Korea
| | - Seokhyun Yoon
- Dept. of Electronics and Electrical Engineering, College of Engineering, Dankook Univ, Yongin-si, 16890, Korea.
| |
Collapse
|
7
|
Tao S, Hou Y, Diao L, Hu Y, Xu W, Xie S, Xiao Z. Long noncoding RNA study: Genome-wide approaches. Genes Dis 2023; 10:2491-2510. [PMID: 37554208 PMCID: PMC10404890 DOI: 10.1016/j.gendis.2022.10.024] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Revised: 10/09/2022] [Accepted: 10/23/2022] [Indexed: 11/30/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) have been confirmed to play a crucial role in various biological processes across several species. Though many efforts have been devoted to the expansion of the lncRNAs landscape, much about lncRNAs is still unknown due to their great complexity. The development of high-throughput technologies and the constantly improved bioinformatic methods have resulted in a rapid expansion of lncRNA research and relevant databases. In this review, we introduced genome-wide research of lncRNAs in three parts: (i) novel lncRNA identification by high-throughput sequencing and computational pipelines; (ii) functional characterization of lncRNAs by expression atlas profiling, genome-scale screening, and the research of cancer-related lncRNAs; (iii) mechanism research by large-scale experimental technologies and computational analysis. Besides, primary experimental methods and bioinformatic pipelines related to these three parts are summarized. This review aimed to provide a comprehensive and systemic overview of lncRNA genome-wide research strategies and indicate a genome-wide lncRNA research system.
Collapse
Affiliation(s)
- Shuang Tao
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Yarui Hou
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Liting Diao
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Yanxia Hu
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Wanyi Xu
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Shujuan Xie
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
- Institute of Vaccine, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| | - Zhendong Xiao
- The Biotherapy Center, The Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong 510630, China
| |
Collapse
|
8
|
Ahmadi H, Sheikh-Assadi M, Fatahi R, Zamani Z, Shokrpour M. Optimizing an efficient ensemble approach for high-quality de novo transcriptome assembly of Thymus daenensis. Sci Rep 2023; 13:12415. [PMID: 37524806 PMCID: PMC10390528 DOI: 10.1038/s41598-023-39620-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 07/27/2023] [Indexed: 08/02/2023] Open
Abstract
Non-erroneous and well-optimized transcriptome assembly is a crucial prerequisite for authentic downstream analyses. Each de novo assembler has its own algorithm-dependent pros and cons to handle the assembly issues and should be specifically tested for each dataset. Here, we examined efficiency of seven state-of-art assemblers on ~ 30 Gb data obtained from mRNA-sequencing of Thymus daenensis. In an ensemble workflow, combining the outputs of different assemblers associated with an additional redundancy-reducing step could generate an optimized outcome in terms of completeness, annotatability, and ORF richness. Based on the normalized scores of 16 benchmarking metrics, EvidentialGene, BinPacker, Trinity, rnaSPAdes, CAP3, IDBA-trans, and Velvet-Oases performed better, respectively. EvidentialGene, as the best assembler, totally produced 316,786 transcripts, of which 235,730 (74%) were predicted to have a unique protein hit (on uniref100), and also half of its transcripts contained an ORF. The total number of unique BLAST hits for EvidentialGene was approximately three times greater than that of the worst assembler (Velvet-Oases). EvidentialGene could even capture 17% and 7% more average BLAST hits than BinPacker and Trinity. Although BinPacker and CAP3 produced longer transcripts, the EvidentialGene showed a higher collinearity between transcript size and ORF length. Compared with the other programs, EvidentialGene yielded a higher number of optimal transcript sets, further full-length transcripts, and lower possible misassemblies. Our finding corroborates that in non-model species, relying on a single assembler may not give an entirely satisfactory result. Therefore, this study proposes an ensemble approach of accompanying EvidentialGene pipelines to acquire a superior assembly for T. daenensis.
Collapse
Affiliation(s)
- Hosein Ahmadi
- Department of Horticulture Science, Faculty of Agriculture and Natural Sciences, University of Tehran, Karaj, Iran
| | - Morteza Sheikh-Assadi
- Department of Horticulture Science, Faculty of Agriculture and Natural Sciences, University of Tehran, Karaj, Iran
| | - Reza Fatahi
- Department of Horticulture Science, Faculty of Agriculture and Natural Sciences, University of Tehran, Karaj, Iran.
| | - Zabihollah Zamani
- Department of Horticulture Science, Faculty of Agriculture and Natural Sciences, University of Tehran, Karaj, Iran
| | - Majid Shokrpour
- Department of Horticulture Science, Faculty of Agriculture and Natural Sciences, University of Tehran, Karaj, Iran
| |
Collapse
|
9
|
Tu M, Zeng J, Zhang J, Fan G, Song G. Unleashing the power within short-read RNA-seq for plant research: Beyond differential expression analysis and toward regulomics. FRONTIERS IN PLANT SCIENCE 2022; 13:1038109. [PMID: 36570898 PMCID: PMC9773216 DOI: 10.3389/fpls.2022.1038109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
RNA-seq has become a state-of-the-art technique for transcriptomic studies. Advances in both RNA-seq techniques and the corresponding analysis tools and pipelines have unprecedently shaped our understanding in almost every aspects of plant sciences. Notably, the integration of huge amount of RNA-seq with other omic data sets in the model plants and major crop species have facilitated plant regulomics, while the RNA-seq analysis has still been primarily used for differential expression analysis in many less-studied plant species. To unleash the analytical power of RNA-seq in plant species, especially less-studied species and biomass crops, we summarize recent achievements of RNA-seq analysis in the major plant species and representative tools in the four types of application: (1) transcriptome assembly, (2) construction of expression atlas, (3) network analysis, and (4) structural alteration. We emphasize the importance of expression atlas, coexpression networks and predictions of gene regulatory relationships in moving plant transcriptomes toward regulomics, an omic view of genome-wide transcription regulation. We highlight what can be achieved in plant research with RNA-seq by introducing a list of representative RNA-seq analysis tools and resources that are developed for certain minor species or suitable for the analysis without species limitation. In summary, we provide an updated digest on RNA-seq tools, resources and the diverse applications for plant research, and our perspective on the power and challenges of short-read RNA-seq analysis from a regulomic point view. A full utilization of these fruitful RNA-seq resources will promote plant omic research to a higher level, especially in those less studied species.
Collapse
Affiliation(s)
- Min Tu
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| | - Jian Zeng
- Guangdong Provincial Key Laboratory of Utilization and Conservation of Food and Medicinal Resources in Northern Region, Shaoguan University, Shaoguan, Guangdong, China
| | - Juntao Zhang
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| | - Guozhi Fan
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| | - Guangsen Song
- School of Chemical and Environmental Engineering, Wuhan Polytechnic University, Wuhan, China
| |
Collapse
|
10
|
Sheikh-Assadi M, Naderi R, Salami SA, Kafi M, Fatahi R, Shariati V, Martinelli F, Cicatelli A, Triassi M, Guarino F, Improta G, Claros MG. Normalized Workflow to Optimize Hybrid De Novo Transcriptome Assembly for Non-Model Species: A Case Study in Lilium ledebourii (Baker) Boiss. PLANTS 2022; 11:plants11182365. [PMID: 36145766 PMCID: PMC9503428 DOI: 10.3390/plants11182365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Revised: 08/21/2022] [Accepted: 09/07/2022] [Indexed: 11/16/2022]
Abstract
A high-quality transcriptome is required to advance numerous bioinformatics workflows. Nevertheless, the effectuality of tools for de novo assembly and real precision assembled transcriptomes looks somewhat unexplored, particularly for non-model organisms with complicated (very long, heterozygous, polyploid) genomes. To disclose the performance of various transcriptome assembly programs, this study built 11 single assemblies and analyzed their performance on some significant reference-free and reference-based criteria. As well as to reconfirm the outputs of benchmarks, 55 BLAST were performed and compared using 11 constructed transcriptomes. Concisely, normalized benchmarking demonstrated that Velvet–Oases suffer from the worst results, while the EvidentialGene strategy can provide the most comprehensive and accurate transcriptome of Lilium ledebourii (Baker) Boiss. The BLAST results also confirmed the superiority of EvidentialGene, so it could capture even up to 59% more (than Velvet–Oases) unique gene hits. To promote assembly optimization, with the help of normalized benchmarking, PCA and AHC, it is emphasized that each metric can only provide part of the transcriptome status, and one should never settle for just a few evaluation criteria. This study supplies a framework for benchmarking and optimizing the efficiency of assembly approaches to analyze RNA-Seq data and reveals that selecting an inefficient assembly strategy might result in less identification of unique gene hits.
Collapse
Affiliation(s)
- Morteza Sheikh-Assadi
- Department of Horticultural Science, Faculty of Agricultural Science and Engineering, University of Tehran, Karaj 31587-77871, Iran
- Correspondence: (M.S.-A.); (R.N.)
| | - Roohangiz Naderi
- Department of Horticultural Science, Faculty of Agricultural Science and Engineering, University of Tehran, Karaj 31587-77871, Iran
- Correspondence: (M.S.-A.); (R.N.)
| | - Seyed Alireza Salami
- Department of Horticultural Science, Faculty of Agricultural Science and Engineering, University of Tehran, Karaj 31587-77871, Iran
| | - Mohsen Kafi
- Department of Horticultural Science, Faculty of Agricultural Science and Engineering, University of Tehran, Karaj 31587-77871, Iran
| | - Reza Fatahi
- Department of Horticultural Science, Faculty of Agricultural Science and Engineering, University of Tehran, Karaj 31587-77871, Iran
| | - Vahid Shariati
- NIGEB Genome Center, National Institute of Genetic Engineering and Biotechnology, Tehran 14965/161, Iran
| | - Federico Martinelli
- Department of Biology, University of Florence, 50019 Sesto Fiorentino, Italy
| | - Angela Cicatelli
- Department of Chemistry and Biology “A. Zambelli”, University of Salerno, 84084 Fisciano, Italy
| | - Maria Triassi
- Department of Public Health, University of Naples “Federico II”, 80131 Naples, Italy
| | - Francesco Guarino
- Department of Chemistry and Biology “A. Zambelli”, University of Salerno, 84084 Fisciano, Italy
| | - Giovanni Improta
- Department of Public Health, University of Naples “Federico II”, 80131 Naples, Italy
| | - Manuel Gonzalo Claros
- Molecular Biology and Biochemistry Department, University of Málaga, 29071 Málaga, Spain
- CIBER de Enfermedades Raras (CIBERER), 29071 Málaga, Spain
- Institute of Biomedical Research in Málaga (IBIMA), IBIMA-RARE, 29010 Málaga, Spain
- Instituto de Hortofruticultura Subtropical y Mediterránea (IHSM-UMA-CSIC), 29010 Málaga, Spain
| |
Collapse
|
11
|
Gill transcriptome of the yellow peacock bass (Cichla ocellaris monoculus) exposed to contrasting physicochemical conditions. CONSERV GENET RESOUR 2022. [DOI: 10.1007/s12686-022-01284-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
12
|
Proteotranscriptomics - A facilitator in omics research. Comput Struct Biotechnol J 2022; 20:3667-3675. [PMID: 35891789 PMCID: PMC9293588 DOI: 10.1016/j.csbj.2022.07.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 07/04/2022] [Accepted: 07/04/2022] [Indexed: 11/26/2022] Open
Abstract
Applications in omics research, such as comparative transcriptomics and proteomics, require the knowledge of the species-specific gene sequence and benefit from a comprehensive high-quality annotation of the coding genes to achieve high coverage. While protein-coding genes can in simple cases be detected by scanning the genome for open reading frames, in more complex genomes exonic sequences are separated by introns. Despite advances in sequencing technologies that allow for ever-growing numbers of genomes, the quality of many of the provided genome assemblies do not reach reference quality. These non-contiguous assemblies with gaps and the necessity to predict splice sites limit accurate gene annotation from solely genomic data. In contrast, the transcriptome only contains transcribed gene regions, is devoid of introns and thus provides the optimal basis for the identification of open reading frames. The additional integration of proteomics data to validate predicted protein-coding genes further enriches for accurate gene models. This review outlines the principles of the proteotranscriptomics approach, discusses common challenges and suggests methods for improvement.
Collapse
|
13
|
Yu T, Zhao X, Li G. TransMeta simultaneously assembles multisample RNA-seq reads. Genome Res 2022; 32:1398-1407. [PMID: 35858749 PMCID: PMC9341511 DOI: 10.1101/gr.276434.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 06/03/2022] [Indexed: 11/25/2022]
Abstract
Assembling RNA-seq reads into full-length transcripts is crucial in transcriptomic studies and poses computational challenges. Here we present TransMeta, a simple and robust algorithm that simultaneously assembles RNA-seq reads from multiple samples. TransMeta is designed based on the newly introduced vector-weighted splicing graph model, which enables accurate reconstruction of the consensus transcriptome via incorporating a cosine similarity-based combing strategy and a newly designed label-setting path-searching strategy. Tests on both simulated and real data sets show that TransMeta consistently outperforms PsiCLASS, StringTie2 plus its merge mode, and Scallop plus TACO, the most popular tools, in terms of precision and recall under a wide range of coverage thresholds at the meta-assembly level. Additionally, TransMeta consistently shows superior performance at the individual sample level.
Collapse
Affiliation(s)
- Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Xiaoyu Zhao
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
- School of Mathematical Science, Liaocheng University, Liaocheng 252000, China
| |
Collapse
|
14
|
A thorough annotation of the krill transcriptome offers new insights for the study of physiological processes. Sci Rep 2022; 12:11415. [PMID: 35794144 PMCID: PMC9259678 DOI: 10.1038/s41598-022-15320-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 06/22/2022] [Indexed: 11/09/2022] Open
Abstract
AbstractThe krill species Euphausia superba plays a critical role in the food chain of the Antarctic ecosystem. Significant changes in climate conditions observed in the Antarctic Peninsula region in the last decades have already altered the distribution of krill and its reproductive dynamics. A deeper understanding of the adaptation capabilities of this species is urgently needed. The availability of a large body of RNA-seq assays allowed us to extend the current knowledge of the krill transcriptome. Our study covered the entire developmental process providing information of central relevance for ecological studies. Here we identified a series of genes involved in different steps of the krill moulting cycle, in the reproductive process and in sexual maturation in accordance with what was already described in previous works. Furthermore, the new transcriptome highlighted the presence of differentially expressed genes previously unknown, playing important roles in cuticle development as well as in energy storage during the krill life cycle. The discovery of new opsin sequences, specifically rhabdomeric opsins, one onychopsin, and one non-visual arthropsin, expands our knowledge of the krill opsin repertoire. We have collected all these results into the KrillDB2 database, a resource combining the latest annotation of the krill transcriptome with a series of analyses targeting genes relevant to krill physiology. KrillDB2 provides in a single resource a comprehensive catalog of krill genes; an atlas of their expression profiles over all RNA-seq datasets publicly available; a study of differential expression across multiple conditions. Finally, it provides initial indications about the expression of microRNA precursors, whose contribution to krill physiology has never been reported before.
Collapse
|
15
|
Zhao X, Yu T. Tiglon enables accurate transcriptome assembly via integrating mappings of different aligners. iScience 2022; 25:104067. [PMID: 35355524 PMCID: PMC8958329 DOI: 10.1016/j.isci.2022.104067] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 02/09/2022] [Accepted: 03/10/2022] [Indexed: 11/01/2022] Open
Abstract
Full-length transcript reconstruction has a pivotal role in RNA-seq data analysis. In this research, we present a new genome-guided transcriptome assembly algorithm, namely Tiglon, which integrates multiple alignments of different mapping tools and builds the labeled splice graphs, followed by a label-based dynamic path-searching strategy to reconstruct the transcripts. We evaluate Tiglon on a simulated dataset and 12 real datasets under the Hisat2 and Star mappings. The results indicate that the integrating techniques of Tiglon exhibit great superiority over the state-of-the-art assemblers, including StringTie2 and Scallop, depending on Hisat2 alignments, Star alignments, or the merged alignments of both. Especially, Tiglon is significantly powerful in recovering lowly expressed transcripts. Tiglon is designed for integrating multiple alignments to assemble transcripts Integrating alignments of different aligners is helpful for transcriptome assembly Tiglon proposes a new graph model called the labeled splice graph Our experiments demonstrate that Tiglon outperforms the leading assemblers
Collapse
|
16
|
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23:6514404. [PMID: 35076693 PMCID: PMC8921630 DOI: 10.1093/bib/bbab563] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/03/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Collapse
Affiliation(s)
- Venket Raghavan
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | - Louis Kraft
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | | | | |
Collapse
|
17
|
Xie B, Dashevsky D, Rokyta D, Ghezellou P, Fathinia B, Shi Q, Richardson MK, Fry BG. Dynamic genetic differentiation drives the widespread structural and functional convergent evolution of snake venom proteinaceous toxins. BMC Biol 2022; 20:4. [PMID: 34996434 PMCID: PMC8742412 DOI: 10.1186/s12915-021-01208-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 12/06/2021] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND The explosive radiation and diversification of the advanced snakes (superfamily Colubroidea) was associated with changes in all aspects of the shared venom system. Morphological changes included the partitioning of the mixed ancestral glands into two discrete glands devoted for production of venom or mucous respectively, as well as changes in the location, size and structural elements of the venom-delivering teeth. Evidence also exists for homology among venom gland toxins expressed across the advanced snakes. However, despite the evolutionary novelty of snake venoms, in-depth toxin molecular evolutionary history reconstructions have been mostly limited to those types present in only two front-fanged snake families, Elapidae and Viperidae. To have a broader understanding of toxins shared among extant snakes, here we first sequenced the transcriptomes of eight taxonomically diverse rear-fanged species and four key viperid species and analysed major toxin types shared across the advanced snakes. RESULTS Transcriptomes were constructed for the following families and species: Colubridae - Helicops leopardinus, Heterodon nasicus, Rhabdophis subminiatus; Homalopsidae - Homalopsis buccata; Lamprophiidae - Malpolon monspessulanus, Psammophis schokari, Psammophis subtaeniatus, Rhamphiophis oxyrhynchus; and Viperidae - Bitis atropos, Pseudocerastes urarachnoides, Tropidolaeumus subannulatus, Vipera transcaucasiana. These sequences were combined with those from available databases of other species in order to facilitate a robust reconstruction of the molecular evolutionary history of the key toxin classes present in the venom of the last common ancestor of the advanced snakes, and thus present across the full diversity of colubroid snake venoms. In addition to differential rates of evolution in toxin classes between the snake lineages, these analyses revealed multiple instances of previously unknown instances of structural and functional convergences. Structural convergences included: the evolution of new cysteines to form heteromeric complexes, such as within kunitz peptides (the beta-bungarotoxin trait evolving on at least two occasions) and within SVMP enzymes (the P-IIId trait evolving on at least three occasions); and the C-terminal tail evolving on two separate occasions within the C-type natriuretic peptides, to create structural and functional analogues of the ANP/BNP tailed condition. Also shown was that the de novo evolution of new post-translationally liberated toxin families within the natriuretic peptide gene propeptide region occurred on at least five occasions, with novel functions ranging from induction of hypotension to post-synaptic neurotoxicity. Functional convergences included the following: multiple occasions of SVMP neofunctionalised in procoagulant venoms into activators of the clotting factors prothrombin and Factor X; multiple instances in procoagulant venoms where kunitz peptides were neofunctionalised into inhibitors of the clot destroying enzyme plasmin, thereby prolonging the half-life of the clots formed by the clotting activating enzymatic toxins; and multiple occasions of kunitz peptides neofunctionalised into neurotoxins acting on presynaptic targets, including twice just within Bungarus venoms. CONCLUSIONS We found novel convergences in both structural and functional evolution of snake toxins. These results provide a detailed roadmap for future work to elucidate predator-prey evolutionary arms races, ascertain differential clinical pathologies, as well as documenting rich biodiscovery resources for lead compounds in the drug design and discovery pipeline.
Collapse
Affiliation(s)
- Bing Xie
- Institute of Biology Leiden, Leiden University, 2333BE, Leiden, The Netherlands
| | - Daniel Dashevsky
- Venom Evolution Lab, School of Biological Sciences, University of Queensland, St Lucia, 4072 Australia
- Australian National Insect Collection, Commonwealth Science and Industry Research Organization, ACT, Canberra, 2601 Australia
| | - Darin Rokyta
- Department of Biological Science, Florida State University, Tallahassee, FL 24105 USA
| | - Parviz Ghezellou
- Medicinal Plants and Drugs Research Institute, Shahid Beheshti University, Tehran, 1983969411 Iran
- Institute of Inorganic and Analytical Chemistry, Justus Liebig University Giessen, 35392, Giessen, Germany
| | - Behzad Fathinia
- Department of Biology, Faculty of Science, Yasouj University, Yasouj, 75914 Iran
| | - Qiong Shi
- Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, 518083 China
- BGI Education Center, University of Chinese Academy of Sciences, Shenzhen, 518083 China
| | | | - Bryan G. Fry
- Venom Evolution Lab, School of Biological Sciences, University of Queensland, St Lucia, 4072 Australia
| |
Collapse
|
18
|
Zhao J, Feng H, Zhu D, Lin Y. MultiTrans: An Algorithm for Path Extraction Through Mixed Integer Linear Programming for Transcriptome Assembly. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:48-56. [PMID: 34033544 DOI: 10.1109/tcbb.2021.3083277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Recent advances in RNA-seq technology have made identification of expressed genes affordable, and thus boosting repaid development of transcriptomic studies. Transcriptome assembly, reconstructing all expressed transcripts from RNA-seq reads, is an essential step to understand genes, proteins, and cell functions. Transcriptome assembly remains a challenging problem due to complications in splicing variants, expression levels, uneven coverage and sequencing errors. Here, we formulate the transcriptome assembly problem as path extraction on splicing graphs (or assembly graphs), and propose a novel algorithm MultiTrans for path extraction using mixed integer linear programming. MultiTrans is able to take into consideration coverage constraints on vertices and edges, the number of paths and the paired-end information simultaneously. We benchmarked MultiTrans against two state-of-the-art transcriptome assemblers, TransLiG and rnaSPAdes. Experimental results show that MultiTrans generates more accurate transcripts compared to TransLiG (using the same splicing graphs) and rnaSPAdes (using the same assembly graphs). MultiTrans is freely available at https://github.com/jzbio/MultiTrans.
Collapse
|
19
|
Karmeinski D, Meusemann K, Goodheart JA, Schroedl M, Martynov A, Korshunova T, Wägele H, Donath A. Transcriptomics provides a robust framework for the relationships of the major clades of cladobranch sea slugs (Mollusca, Gastropoda, Heterobranchia), but fails to resolve the position of the enigmatic genus Embletonia. BMC Ecol Evol 2021; 21:226. [PMID: 34963462 PMCID: PMC8895541 DOI: 10.1186/s12862-021-01944-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 11/23/2021] [Indexed: 11/24/2022] Open
Abstract
Background The soft-bodied cladobranch sea slugs represent roughly half of the biodiversity of marine nudibranch molluscs on the planet. Despite their global distribution from shallow waters to the deep sea, from tropical into polar seas, and their important role in marine ecosystems and for humans (as targets for drug discovery), the evolutionary history of cladobranch sea slugs is not yet fully understood. Results To enlarge the current knowledge on the phylogenetic relationships, we generated new transcriptome data for 19 species of cladobranch sea slugs and two additional outgroup taxa (Berthella plumula and Polycera quadrilineata). We complemented our taxon sampling with previously published transcriptome data, resulting in a final data set covering 56 species from all but one accepted cladobranch superfamilies. We assembled all transcriptomes using six different assemblers, selecting those assemblies that provided the largest amount of potentially phylogenetically informative sites. Quality-driven compilation of data sets resulted in four different supermatrices: two with full coverage of genes per species (446 and 335 single-copy protein-coding genes, respectively) and two with a less stringent coverage (667 genes with 98.9% partition coverage and 1767 genes with 86% partition coverage, respectively). We used these supermatrices to infer statistically robust maximum-likelihood trees. All analyses, irrespective of the data set, indicate maximal statistical support for all major splits and phylogenetic relationships at the family level. Besides the questionable position of Noumeaella rubrofasciata, rendering the Facelinidae as polyphyletic, the only notable discordance between the inferred trees is the position of Embletonia pulchra. Extensive testing using Four-cluster Likelihood Mapping, Approximately Unbiased tests, and Quartet Scores revealed that its position is not due to any informative phylogenetic signal, but caused by confounding signal. Conclusions Our data matrices and the inferred trees can serve as a solid foundation for future work on the taxonomy and evolutionary history of Cladobranchia. The placement of E. pulchra, however, proves challenging, even with large data sets and various optimization strategies. Moreover, quartet mapping results show that confounding signal present in the data is sufficient to explain the inferred position of E. pulchra, again leaving its phylogenetic position as an enigma. Supplementary Information The online version contains supplementary material available at 10.1186/s12862-021-01944-0.
Collapse
Affiliation(s)
- Dario Karmeinski
- Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change/ZFMK, Museum Koenig, Adenauerallee 160, 53113, Bonn, Germany
| | - Karen Meusemann
- Leibniz Institute for the Analysis of Biodiversity Change/ZFMK, Museum Koenig, Adenauerallee 160, 53113, Bonn, Germany.,Australian National Insect Collection, National Research Collections Australia, Commonwealth Scientific and Industrial Research Organisation (CSIRO), National Facilities and Collections, Clunies Ross Street, Acton, Canberra, ACT, 2601, Australia
| | - Jessica A Goodheart
- Scripps Institution of Oceanography, University of California, La Jolla, San Diego, CA, 92037, USA
| | - Michael Schroedl
- SNSB-Bavarian State Collection of Zoology, Münchhausenstr. 21, 81247, Munich, Germany.,GeoBioCenter LMU und Biozentrum, Ludwig-Maximilians-Universität München, Großhaderner Str. 2, 82152, Planegg-Martinsried, Germany
| | - Alexander Martynov
- Zoological Museum of the Moscow State University, Bolshaya Nikitskaya Str. 6, 125009, Moscow, Russia
| | - Tatiana Korshunova
- Koltzov Institute of Developmental Biology, Vavilova Str. 26, 119334, Moscow, Russia
| | - Heike Wägele
- Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change/ZFMK, Museum Koenig, Adenauerallee 160, 53113, Bonn, Germany
| | - Alexander Donath
- Centre for Molecular Biodiversity Research, Leibniz Institute for the Analysis of Biodiversity Change/ZFMK, Museum Koenig, Adenauerallee 160, 53113, Bonn, Germany.
| |
Collapse
|
20
|
CStone: A de novo transcriptome assembler for short-read data that identifies non-chimeric contigs based on underlying graph structure. PLoS Comput Biol 2021; 17:e1009631. [PMID: 34813594 PMCID: PMC8651127 DOI: 10.1371/journal.pcbi.1009631] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Revised: 12/07/2021] [Accepted: 11/11/2021] [Indexed: 11/19/2022] Open
Abstract
With the exponential growth of sequence information stored over the last decade, including that of de novo assembled contigs from RNA-Seq experiments, quantification of chimeric sequences has become essential when assembling read data. In transcriptomics, de novo assembled chimeras can closely resemble underlying transcripts, but patterns such as those seen between co-evolving sites, or mapped read counts, become obscured. We have created a de Bruijn based de novo assembler for RNA-Seq data that utilizes a classification system to describe the complexity of underlying graphs from which contigs are created. Each contig is labelled with one of three levels, indicating whether or not ambiguous paths exist. A by-product of this is information on the range of complexity of the underlying gene families present. As a demonstration of CStones ability to assemble high-quality contigs, and to label them in this manner, both simulated and real data were used. For simulated data, ten million read pairs were generated from cDNA libraries representing four species, Drosophila melanogaster, Panthera pardus, Rattus norvegicus and Serinus canaria. These were assembled using CStone, Trinity and rnaSPAdes; the latter two being high-quality, well established, de novo assembers. For real data, two RNA-Seq datasets, each consisting of ≈30 million read pairs, representing two adult D. melanogaster whole-body samples were used. The contigs that CStone produced were comparable in quality to those of Trinity and rnaSPAdes in terms of length, sequence identity of aligned regions and the range of cDNA transcripts represented, whilst providing additional information on chimerism. Here we describe the details of CStones assembly and classification process, and propose that similar classification systems can be incorporated into other de novo assembly tools. Within a related side study, we explore the effects that chimera’s within reference sets have on the identification of differentially expression genes. CStone is available at: https://sourceforge.net/projects/cstone/. Within transcriptome reference sets, non-chimeric sequences are representations of transcribed genes, while artificially generated chimeric ones are mosaics of two or more pieces of DNA incorrectly pieced together. One area where such sets are utilized is in the quantification of gene expression patterns; where RNA-Seq reads are mapped to the sequences within, and subsequent count values reflect expression levels. Artificial chimeras can have a negative impact on count values by erroneously increasing variation in relation to the reads being mapped. Reference sets can be created from de novo assembled contigs, but chimeras can be introduced during the assembly process via the required traversal of graphs, representing gene families, constructed from the RNA-Seq data. Graph complexity determines how likely chimeras will arise. We have created CStone, a de novo assembler that utilizes a classification system to describe such complexity. Contigs created by CStone are labelled in a manner that indicates whether or not they are non-chimeric. This encourages contig dependent results to be presented with increased objectivity by maintaining the context of ambiguity associated with the assembly process. CStone has been tested extensively. Additionally, we have quantified the relationship between chimeras within reference sets and the identification of differentially expressed genes.
Collapse
|
21
|
Sharma A, Bhattacharyya D, Sharma S, Chauhan RS. Transcriptome profiling reveal key hub genes in co-expression networks involved in Iridoid glycosides biosynthetic machinery in Picrorhiza kurroa. Genomics 2021; 113:3381-3394. [PMID: 34332040 DOI: 10.1016/j.ygeno.2021.07.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Revised: 07/15/2021] [Accepted: 07/22/2021] [Indexed: 10/20/2022]
Abstract
Picrorhiza kurroa is a medicinal herb rich in hepatoprotective iridoid glycosides, picroside-I (P-I) and picroside-II (P-II). The biosynthetic machinery of picrosides is poorly understood, therefore, 'no-direction' gene co-expression networks were used to extract linked/closed and separated interactions in terpenoid glycosides-specific sub-networks. Transcriptomes generated from different organs, varying for P-I and P-II contents such as shoots grown at 15 and 25 °C and nursery-grown shoots, stolons, and roots resulted in 47,726, 44,958, 40,117, 66,979, and 55,578 annotated transcripts, respectively. Occurrence of 2810 ± 136 nodes and 15,626 ± 696 edges in these networks indicated intense, co-expressed, closed loop interactions. Either deregulation/inhibition of abscisic acid (ABA) biosynthesis/signaling or constitutive degradation of ABA resulted in organ-specific accumulation of P-I and P-II. Biosynthesis, condensation and glucosylation of isoprene units may occur in shoots, roots or stolons; but addition of phenylpropanoid moiety and further modification/s of the iridoid backbone occurs mainly inside vacuoles in roots.
Collapse
Affiliation(s)
- Ashish Sharma
- Department of Biotechnology, School of Engineering & Applied Sciences, Bennett University, Greater Noida, Uttar Pradesh 201310, India
| | - Dipto Bhattacharyya
- Department of Biotechnology, School of Engineering & Applied Sciences, Bennett University, Greater Noida, Uttar Pradesh 201310, India
| | - Shilpa Sharma
- Department of Biotechnology, School of Engineering & Applied Sciences, Bennett University, Greater Noida, Uttar Pradesh 201310, India
| | - Rajinder Singh Chauhan
- Department of Biotechnology, School of Engineering & Applied Sciences, Bennett University, Greater Noida, Uttar Pradesh 201310, India.
| |
Collapse
|
22
|
Yu T, Han R, Fang Z, Mu Z, Zheng H, Liu J. TransRef enables accurate transcriptome assembly by redefining accurate neo-splicing graphs. Brief Bioinform 2021; 22:6319943. [PMID: 34254977 DOI: 10.1093/bib/bbab261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 06/09/2021] [Accepted: 01/22/2020] [Indexed: 11/14/2022] Open
Abstract
RNA-seq technology is widely employed in various research areas related to transcriptome analyses, and the identification of all the expressed transcripts from short sequencing reads presents a considerable computational challenge. In this study, we introduce TransRef, a new computational algorithm for accurate transcriptome assembly by redefining a novel graph model, the neo-splicing graph, and then iteratively applying a constrained dynamic programming to reconstruct all the expressed transcripts for each graph. When TransRef is utilized to analyze both real and simulated datasets, its performance is notably better than those of several state-of-the-art assemblers, including StringTie2, Cufflinks and Scallop. In particular, the performance of TransRef is notably strong in identifying novel transcripts and transcripts with low-expression levels, while the other assemblers are less effective.
Collapse
Affiliation(s)
- Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Zhaoyuan Fang
- Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Zengchao Mu
- School of Mathematics from Shandong University, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Juntao Liu
- School of Mathematics and Statistics at Shandong University, Weihai, China
| |
Collapse
|
23
|
Pincho: A Modular Approach to High Quality De Novo Transcriptomics. Genes (Basel) 2021; 12:genes12070953. [PMID: 34206353 PMCID: PMC8304035 DOI: 10.3390/genes12070953] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 06/17/2021] [Accepted: 06/18/2021] [Indexed: 11/16/2022] Open
Abstract
Transcriptomic reconstructions without reference (i.e., de novo) are common for data samples derived from non-model biological systems. These assemblies involve massive parallel short read sequence reconstructions from experiments, but they usually employ ad-hoc bioinformatic workflows that exhibit limited standardization and customization. The increasing number of transcriptome assembly software continues to provide little room for standardization which is exacerbated by the lack of studies on modularity that compare the effects of assembler synergy. We developed a customizable management workflow for de novo transcriptomics that includes modular units for short read cleaning, assembly, validation, annotation, and expression analysis by connecting twenty-five individual bioinformatic tools. With our software tool, we were able to compare the assessment scores based on 129 distinct single-, bi- and tri-assembler combinations with diverse k-mer size selections. Our results demonstrate a drastic increase in the quality of transcriptome assemblies with bi- and tri- assembler combinations. We aim for our software to improve de novo transcriptome reconstructions for the ever-growing landscape of RNA-seq data derived from non-model systems. We offer guidance to ensure the most complete transcriptomic reconstructions via the inclusion of modular multi-assembly software controlled from a single master console.
Collapse
|
24
|
Liu P, Ewald J, Galvez JH, Head J, Crump D, Bourque G, Basu N, Xia J. Ultrafast functional profiling of RNA-seq data for nonmodel organisms. Genome Res 2021; 31:713-720. [PMID: 33731361 PMCID: PMC8015844 DOI: 10.1101/gr.269894.120] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 02/18/2021] [Indexed: 12/02/2022]
Abstract
Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.
Collapse
Affiliation(s)
- Peng Liu
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Jessica Ewald
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Jose Hector Galvez
- Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada.,Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada
| | - Jessica Head
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Doug Crump
- Environment and Climate Change Canada, National Wildlife Research Centre, Ottawa, Ontario K1A 0H3, Canada
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada.,Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada
| | - Niladri Basu
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Jianguo Xia
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada.,Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada
| |
Collapse
|
25
|
Duarte GT, Volkova PY, Geras'kin SA. A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine. Bio Protoc 2021; 11:e3912. [PMID: 33732799 DOI: 10.21769/bioprotoc.3912] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 12/13/2020] [Accepted: 12/15/2020] [Indexed: 11/02/2022] Open
Abstract
RNA sequencing (RNA-seq) has opened up the possibility of studying virtually any organism at the whole transcriptome level. Nevertheless, the absence of a sequenced and accurately annotated reference genome may be an obstacle for applying this technique to non-model organisms, especially for those with a complex genome. While de novo transcriptome assembly can circumvent this problem, it is often computationally demanding. Furthermore, the transcriptome annotation and Gene Ontology enrichment analysis without an automatized system is often a laborious task. Here we describe step-by-step the pipeline that was used to perform the transcriptome assembly, annotation, and Gene Ontology analysis of Scots pine (Pinus sylvestris), a gymnosperm species with complex genome. Using only free software available for the scientific community and running on a standard personal computer, the pipeline intends to facilitate transcriptomic studies for non-model species, yet being flexible to be used with any organism.
Collapse
Affiliation(s)
- Gustavo T Duarte
- Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany.,Russian Institute of Radiology and Agroecology, Obninsk, Russia
| | | | | |
Collapse
|
26
|
Lataretu M, Hölzer M. RNAflow: An Effective and Simple RNA-Seq Differential Gene Expression Pipeline Using Nextflow. Genes (Basel) 2020; 11:E1487. [PMID: 33322033 PMCID: PMC7763471 DOI: 10.3390/genes11121487] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Revised: 12/04/2020] [Accepted: 12/07/2020] [Indexed: 12/28/2022] Open
Abstract
RNA-Seq enables the identification and quantification of RNA molecules, often with the aim of detecting differentially expressed genes (DEGs). Although RNA-Seq evolved into a standard technique, there is no universal gold standard for these data's computational analysis. On top of that, previous studies proved the irreproducibility of RNA-Seq studies. Here, we present a portable, scalable, and parallelizable Nextflow RNA-Seq pipeline to detect DEGs, which assures a high level of reproducibility. The pipeline automatically takes care of common pitfalls, such as ribosomal RNA removal and low abundance gene filtering. Apart from various visualizations for the DEG results, we incorporated downstream pathway analysis for common species as Homo sapiens and Mus musculus. We evaluated the DEG detection functionality while using qRT-PCR data serving as a reference and observed a very high correlation of the logarithmized gene expression fold changes.
Collapse
Affiliation(s)
- Marie Lataretu
- RNA Bioinformatics and High-Throughput Analysis, Friedrich Schiller University Jena, Leutragraben 1, 07743 Jena, Germany;
| | - Martin Hölzer
- Methodology and Research Infrastructure, MF1 Bioinformatics, Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| |
Collapse
|
27
|
Niu J, Hu XL, Ip JCH, Ma KY, Tang Y, Wang Y, Qin J, Qiu JW, Chan TF, Chu KH. Multi-omic approach provides insights into osmoregulation and osmoconformation of the crab Scylla paramamosain. Sci Rep 2020; 10:21771. [PMID: 33303836 PMCID: PMC7728780 DOI: 10.1038/s41598-020-78351-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 11/23/2020] [Indexed: 12/18/2022] Open
Abstract
Osmoregulation and osmoconformation are two mechanisms through which aquatic animals adapt to salinity fluctuations. The euryhaline crab Scylla paramamosain, being both an osmoconformer and osmoregulator, is an excellent model organism to investigate salinity adaptation mechanisms in brachyurans. In the present study, we used transcriptomic and proteomic approaches to investigate the response of S. paramamosain to salinity stress. Crabs were transferred from a salinity of 25 ppt to salinities of 5 ppt or 33 ppt for 6 h and 10 days. Data from both approaches revealed that exposure to 5 ppt resulted in upregulation of ion transport and energy metabolism associated genes. Notably, acclimation to low salinity was associated with early changes in gene expression for signal transduction and stress response. In contrast, exposure to 33 ppt resulted in upregulation of genes related to amino acid metabolism, and amino acid transport genes were upregulated only at the early stage of acclimation to this salinity. Our study reveals contrasting mechanisms underlying osmoregulation and osmoconformation within the salinity range of 5–33 ppt in the mud crab, and provides novel candidate genes for osmotic signal transduction, thereby providing insights on understanding the salinity adaptation mechanisms of brachyuran crabs.
Collapse
Affiliation(s)
- Jiaojiao Niu
- Simon F. S. Li Marine Science Laboratory, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Xue Lei Hu
- Simon F. S. Li Marine Science Laboratory, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Jack C H Ip
- Department of Biology, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, China
| | - Ka Yan Ma
- Simon F. S. Li Marine Science Laboratory, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Yuanyuan Tang
- Simon F. S. Li Marine Science Laboratory, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Yaqin Wang
- Simon F. S. Li Marine Science Laboratory, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Jing Qin
- School of Pharmaceutical Sciences (Shenzhen), Sun Yat-Sen University, Guangzhou, 510275, China
| | - Jian-Wen Qiu
- Department of Biology, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, China
| | - Ting Fung Chan
- State Key Laboratory of Agrobiotechnology, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China
| | - Ka Hou Chu
- Simon F. S. Li Marine Science Laboratory, School of Life Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China.
| |
Collapse
|
28
|
Das RR, Pradhan S, Parida A. De-novo transcriptome analysis unveils differentially expressed genes regulating drought and salt stress response in Panicum sumatrense. Sci Rep 2020; 10:21251. [PMID: 33277539 PMCID: PMC7718891 DOI: 10.1038/s41598-020-78118-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 11/03/2020] [Indexed: 12/15/2022] Open
Abstract
Screening the transcriptome of drought tolerant variety of little millet (Panicum sumatrense), a marginally cultivated, nutritionally rich, susbsistent crop, can identify genes responsible for its hardiness and enable identification of new sources of genetic variation which can be used for crop improvement. RNA-Seq generated ~ 230 million reads from control and treated tissues, which were assembled into 86,614 unigenes. In silico differential gene expression analysis created an overview of patterns of gene expression during exposure to drought and salt stress. Separate gene expression profiles for leaf and root tissue revealed the differences in regulatory mechanisms operating in these tissues during exposure to abiotic stress. Several transcription factors were identified and studied for differential expression. 61 differentially expressed genes were found to be common to both tissues under drought and salinity stress and were further validated using qRT-PCR. Transcriptome of P. sumatrense was also used to mine for genic SSR markers relevant to abiotic stress tolerance. This study is first report on a detailed analysis of molecular mechanisms of drought and salinity stress tolerance in a little millet variety. Resources generated in this study can be used as potential candidates for further characterization and to improve abiotic stress tolerance in food crops.
Collapse
Affiliation(s)
- Rasmita Rani Das
- Institute of Life Sciences, NALCO Square, Chandrasekharpur, Bhubaneswar, 751023, India
| | - Seema Pradhan
- Institute of Life Sciences, NALCO Square, Chandrasekharpur, Bhubaneswar, 751023, India
| | - Ajay Parida
- Institute of Life Sciences, NALCO Square, Chandrasekharpur, Bhubaneswar, 751023, India.
| |
Collapse
|
29
|
Yu T, Liu J, Gao X, Li G. iPAC: a genome-guided assembler of isoforms via phasing and combing paths. Bioinformatics 2020; 36:2712-2717. [PMID: 31985799 DOI: 10.1093/bioinformatics/btaa052] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 12/14/2019] [Accepted: 01/20/2020] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Full-length transcript reconstruction is very important and quite challenging for the widely used RNA-seq data analysis. Currently, available RNA-seq assemblers generally suffered from serious limitations in practical applications, such as low assembly accuracy and incompatibility with latest alignment tools. RESULTS We introduce iPAC, a new genome-guided assembler for reconstruction of isoforms, which revolutionizes the usage of paired-end and sequencing depth information via phasing and combing paths over a newly designed phasing graph. Tested on both simulated and real datasets, it is to some extent superior to all the salient assemblers of the same kind. Especially, iPAC is significantly powerful in recovery of lowly expressed transcripts while others are not. AVAILABILITY AND IMPLEMENTATION iPAC is freely available at http://sourceforge.net/projects/transassembly/files. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ting Yu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Juntao Liu
- School of Mathematics, Shandong University, Jinan 250100, China
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan 250100, China.,Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
30
|
Alam MNU, Chowdhury UF. Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses. PLoS One 2020; 15:e0239381. [PMID: 32946529 PMCID: PMC7500682 DOI: 10.1371/journal.pone.0239381] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 09/06/2020] [Indexed: 01/20/2023] Open
Abstract
High-throughput sequencing technologies have greatly enabled the study of genomics, transcriptomics and metagenomics. Automated annotation and classification of the vast amounts of generated sequence data has become paramount for facilitating biological sciences. Genomes of viruses can be radically different from all life, both in terms of molecular structure and primary sequence. Alignment-based and profile-based searches are commonly employed for characterization of assembled viral contigs from high-throughput sequencing data. Recent attempts have highlighted the use of machine learning models for the task, but these models rely entirely on DNA genomes and owing to the intrinsic genomic complexity of viruses, RNA viruses have gone completely overlooked. Here, we present a novel short k-mer based sequence scoring method that generates robust sequence information for training machine learning classifiers. We trained 18 classifiers for the task of distinguishing viral RNA from human transcripts. We challenged our models with very stringent testing protocols across different species and evaluated performance against BLASTn, BLASTx and HMMER3 searches. For clean sequence data retrieved from curated databases, our models display near perfect accuracy, outperforming all similar attempts previously reported. On de novo assemblies of raw RNA-Seq data from cells subjected to Ebola virus, the area under the ROC curve varied from 0.6 to 0.86 depending on the software used for assembly. Our classifier was able to properly classify the majority of the false hits generated by BLAST and HMMER3 searches on the same data. The outstanding performance metrics of our model lays the groundwork for robust machine learning methods for the automated annotation of sequence data.
Collapse
Affiliation(s)
- Md. Nafis Ul Alam
- Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh
| | - Umar Faruq Chowdhury
- Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh
| |
Collapse
|
31
|
Response of Downy Oak (Quercus pubescens Willd.) to Climate Change: Transcriptome Assembly, Differential Gene Analysis and Targeted Metabolomics. PLANTS 2020; 9:plants9091149. [PMID: 32899727 PMCID: PMC7570186 DOI: 10.3390/plants9091149] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 08/20/2020] [Accepted: 09/01/2020] [Indexed: 01/15/2023]
Abstract
Global change scenarios in the Mediterranean basin predict a precipitation reduction within the coming hundred years. Therefore, increased drought will affect forests both in terms of adaptive ecology and ecosystemic services. However, how vegetation might adapt to drought is poorly understood. In this report, four years of climate change was simulated by excluding 35% of precipitation above a downy oak forest. RNASeq data allowed us to assemble a genome-guided transcriptome. This led to the identification of differentially expressed features, which was supported by the characterization of target metabolites using a metabolomics approach. We provided 2.5 Tb of RNASeq data and the assembly of the first genome guided transcriptome of Quercus pubescens. Up to 5724 differentially expressed transcripts were obtained; 42 involved in plant response to drought. Transcript set enrichment analysis showed that drought induces an increase in oxidative pressure that is mitigated by the upregulation of ubiquitin-like protein protease, ferrochelatase, oxaloacetate decarboxylase and oxo-acid-lyase activities. Furthermore, the downregulation of auxin biosynthesis and transport, carbohydrate storage metabolism were observed as well as the concomitant accumulation of metabolites, such as oxalic acid, malate and isocitrate. Our data suggest that early metabolic changes in the resistance of Q. pubescens to drought involve a tricarboxylic acid (TCA) cycle shunt through the glyoxylate pathway, galactose metabolism by reducing carbohydrate storage and increased proteolytic activity.
Collapse
|
32
|
Yu T, Mu Z, Fang Z, Liu X, Gao X, Liu J. TransBorrow: genome-guided transcriptome assembly by borrowing assemblies from different assemblers. Genome Res 2020; 30:1181-1190. [PMID: 32817072 PMCID: PMC7462071 DOI: 10.1101/gr.257766.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 06/18/2020] [Indexed: 12/12/2022]
Abstract
RNA-seq technology is widely used in various transcriptomic studies and provides great opportunities to reveal the complex structures of transcriptomes. To effectively analyze RNA-seq data, we introduce a novel transcriptome assembler, TransBorrow, which borrows the assemblies from different assemblers to search for reliable subsequences by building a colored graph from those borrowed assemblies. Then, by seeding reliable subsequences, a newly designed path extension strategy accurately searches for a transcript-representing path cover over each splicing graph. TransBorrow was tested on both simulated and real data sets and showed great superiority over all the compared leading assemblers.
Collapse
Affiliation(s)
- Ting Yu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Zengchao Mu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Zhaoyuan Fang
- Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| | - Xiaoping Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Juntao Liu
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai 264209, China
| |
Collapse
|
33
|
Osborne OG, Kafle T, Brewer T, Dobreva MP, Hutton I, Savolainen V. Sympatric speciation in mountain roses ( Metrosideros) on an oceanic island. Philos Trans R Soc Lond B Biol Sci 2020; 375:20190542. [PMID: 32654651 DOI: 10.1098/rstb.2019.0542] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Shifts in flowering time have the potential to act as strong prezygotic reproductive barriers in plants. We investigate the role of flowering time divergence in two species of mountain rose (Metrosideros) endemic to Lord Howe Island, Australia, a minute and isolated island in the Tasman Sea. Metrosideros nervulosa and M. sclerocarpa are sister species and have divergent ecological niches on the island but grow sympatrically for much of their range, and likely speciated in situ on the island. We used flowering time and population genomic analyses of population structure and selection, to investigate their evolution, with a particular focus on the role of flowering time in their speciation. Population structure analyses showed the species are highly differentiated and appear to be in the very late stages of speciation. We found flowering times of the species to be significantly displaced, with M. sclerocarpa flowering 53 days later than M. nervulosa. Furthermore, the analyses of selection showed that flowering time genes are under selection between the species. Thus, prezygotic reproductive isolation is mediated by flowering time shifts in the species, and likely evolved under selection, to drive the completion of speciation within a small geographical area. This article is part of the theme issue 'Towards the completion of speciation: the evolution of reproductive isolation beyond the first barriers'.
Collapse
Affiliation(s)
- Owen G Osborne
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot SL5 7PY, UK
| | - Tane Kafle
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot SL5 7PY, UK
| | - Tom Brewer
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot SL5 7PY, UK
| | - Mariya P Dobreva
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot SL5 7PY, UK
| | - Ian Hutton
- Lord Howe Island Museum, Lord Howe Island, NSW 2898, Australia
| | - Vincent Savolainen
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot SL5 7PY, UK
| |
Collapse
|
34
|
Transcriptome analysis of heat stressed seedlings with or without pre-heat treatment in Cryptomeria japonica. Mol Genet Genomics 2020; 295:1163-1172. [PMID: 32472284 DOI: 10.1007/s00438-020-01689-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 05/19/2020] [Indexed: 10/24/2022]
Abstract
With global warming as a major environment concern over the coming years, heat tolerance is an important trait for forest tree survival during the predicted future warmer weather conditions. Cryptomeria japonica is a coniferous species widely distributed throughout Japan, and thus, can adapt to a wide range of air temperatures. To elucidate genes involved in heat response in Cryptomeria japonica, transcriptome analysis was conducted for seedlings under heat shock conditions. To test whether heat acclimation affects levels of gene expression, half of the seedlings were pretreated with moderately high temperatures prior to heat shock. De novo assembly of the transcriptome generated 107,924 unigenes and the analysis of differentially expressed genes was conducted using these unigenes. A total of 5217 differentially expressed genes were identified. Most genes upregulated by heat shock, regardless of pre-heat treatment, were conserved to heat response genes of angiosperm species, such as heat shock factors (Hsf) and heat shock proteins (Hsp). Pre-heating of seedlings affected expression levels of several Hsfs and their induction was lower in pre-heated seedlings than in seedlings without pre-heat treatment. This suggests a conserved role of Hsfs in heat response and heat acclimation in seed plants. On the other hand, many unknown genes were upregulated in only seedlings without pre-heat treatment after heat exposure. Notably, expression of gypsy/Ty3 type retrotransposons was dramatically induced. These findings provide valuable information to develop a better understanding of the molecular mechanisms of heat response and acclimation in C. japonica.
Collapse
|
35
|
Huerlimann R, Maes GE, Maxwell MJ, Mobli M, Launikonis BS, Jerry DR, Wade NM. Multi-species transcriptomics reveals evolutionary diversity in the mechanisms regulating shrimp tail muscle excitation-contraction coupling. Gene 2020; 752:144765. [PMID: 32413480 DOI: 10.1016/j.gene.2020.144765] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Revised: 04/17/2020] [Accepted: 05/11/2020] [Indexed: 11/30/2022]
Abstract
The natural flight response in shrimp is powered by rapid contractions of the abdominal muscle fibres to propel themselves backwards away from perceived danger. This muscle contraction is dependent on repetitive depolarization of muscle plasma membrane, triggering tightly spaced cytoplasmic [Ca2+] transients and rapidly rising tetanic force responses. To achieve such high amplitude and high frequency of Ca2+ transients requires a high abundance of sarcoplasmic/endoplasmic reticulum Ca2+ ATPase (SERCA) to rapidly clear cytoplasmic Ca2+ between each transient and an efficient Ca2+ release system consisting of the Ryanodine Receptor (RyR), and voltage gated Ca2+ channels (CaVs). With the aim to expand our knowledge of muscle gene function and identify orthologous genes regulating muscle excitation-contraction (EC) coupling, this study assembled nine Penaeid shrimp muscle transcriptomes. On average, the nine transcriptomes contained 27,000 contigs, with an annotation rate of 36% and a BUSCO completeness of 70%. Despite maintaining their function, the crustacean RyR and CaV proteins showed evidence of significant diversification from mammalian orthologs, while SERCA remained more conserved. Several key components of protein interaction were conserved, while others showed distinct crustacean specific evolutionary adaptations. Lastly, this study revealed approximately 1,000 orthologous genes involved in muscle specific processes present across all nine species.
Collapse
Affiliation(s)
- Roger Huerlimann
- ARC Research Hub for Advanced Prawn Breeding, Australia; Centre for Sustainable Tropical Fisheries and Aquaculture, College of Science and Engineering, James Cook University, Townsville, QLD 4811, Australia; Centre for Tropical Bioinformatics and Molecular Biology, James Cook University, Townsville, QLD 4811, Australia.
| | - Gregory E Maes
- Centre for Sustainable Tropical Fisheries and Aquaculture, College of Science and Engineering, James Cook University, Townsville, QLD 4811, Australia; Laboratory of Biodiversity and Evolutionary Genomics, KU Leuven, Leuven 3000, Belgium; Centre for Human Genetics, KU Leuven, Leuven 3000, Belgium
| | - Michael J Maxwell
- Centre for Advanced Imaging, University of Queensland, Brisbane, QLD 4072, Australia
| | - Mehdi Mobli
- Centre for Advanced Imaging, University of Queensland, Brisbane, QLD 4072, Australia
| | - Bradley S Launikonis
- School of Biomedical Sciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Dean R Jerry
- ARC Research Hub for Advanced Prawn Breeding, Australia; Centre for Sustainable Tropical Fisheries and Aquaculture, College of Science and Engineering, James Cook University, Townsville, QLD 4811, Australia; Tropical Futures Institute, James Cook University, 149 Sims Drive, Singapore 387380, Singapore
| | - Nicholas M Wade
- ARC Research Hub for Advanced Prawn Breeding, Australia; CSIRO Agriculture and Food, Aquaculture Program, 306 Carmody Road, St Lucia, QLD 4067
| |
Collapse
|
36
|
Zhao J, Feng H, Zhu D, Zhang C, Xu Y. IsoTree: A New Framework for de novo Transcriptome Assembly from RNA-seq Reads. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:938-948. [PMID: 29994455 DOI: 10.1109/tcbb.2018.2808350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
High-throughput sequencing of mRNA has made the deep and efficient probing of transcriptome more affordable. However, the vast amounts of short RNA-seq reads make de novo transcriptome assembly an algorithmic challenge. In this work, we present IsoTree, a novel framework for transcripts reconstruction in the absence of reference genomes. Unlike most of de novo assembly methods that build de Bruijn graph or splicing graph by connecting k- mers which are sets of overlapping substrings generated from reads, IsoTree constructs splicing graph by connecting reads directly. For each splicing graph, IsoTree applies an iterative scheme of mixed integer linear program to build a prefix tree, called isoform tree. Each path from the root node of the isoform tree to a leaf node represents a plausible transcript candidate which will be pruned based on the information of paired-end reads. Experiments showed that in most cases IsoTree performs better than other leading transcriptome assembly programs. IsoTree is available at https://github.com/Jane110111107/IsoTree.
Collapse
|
37
|
Sadat-Hosseini M, Bakhtiarizadeh MR, Boroomand N, Tohidfar M, Vahdati K. Combining independent de novo assemblies to optimize leaf transcriptome of Persian walnut. PLoS One 2020; 15:e0232005. [PMID: 32343733 PMCID: PMC7188282 DOI: 10.1371/journal.pone.0232005] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2019] [Accepted: 04/06/2020] [Indexed: 12/22/2022] Open
Abstract
Transcriptome resources can facilitate to increase yield and quality of walnuts. Finding the best transcriptome assembly has not been the subject of walnuts research as yet. This research generated 240,179,782 reads from 11 walnut leaves according to cDNA libraries. The reads provided a complete de novo transcriptome assembly. Fifteen different transcriptome assemblies were constructed from five different well-known assemblers used in scientific literature with different k-mer lengths (Bridger, BinPacker, SOAPdenovo-Trans, Trinity and SPAdes) as well as two merging approaches (EvidentialGene and Transfuse). Based on the four quality metrics of assembly, the results indicated an efficiency in the process of merging the assemblies after being generated by de novo assemblers. Finally, EvidentialGene was recognized as the best assembler for the de novo assembly of the leaf transcriptome in walnut. Among a total number of 183,191 transcripts which were generated by EvidentialGene, there were 109,413 transcripts capable of protein potential (59.72%) and 104,926 were recognized as ORFs (57.27%). In addition, 79,185 transcripts were predicted to exist with at least one hit to the Pfam database. A number of 3,931 transcription factors were identified by BLAST searching against PlnTFDB. Furthermore, 6,591 of the predicted peptide sequences contained signaling peptides, while 92,704 contained transmembrane domains. Comparison of the assembled transcripts with transcripts of the walnut and published genome assembly for the 'Chandler' cultivar using the BLAST algorithm led to identify a total number of 27,304 and 19,178 homologue transcripts, respectively. De novo transcriptomes in walnut leaves can be developed for the future studies in functional genomics and genetic studies of walnuts.
Collapse
Affiliation(s)
- Mohammad Sadat-Hosseini
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
- Department of Horticulture, Faculty of Agriculture, University of Jiroft, Jiroft, Iran
| | | | - Naser Boroomand
- Department of Soil Science, Faculty of Agriculture, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Masoud Tohidfar
- Department of Plant Biotechnology, Faculty of Life Science and Biotechnology, Shahid Beheshti University, Tehran, Iran
| | - Kourosh Vahdati
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
| |
Collapse
|
38
|
Freedman AH, Clamp M, Sackton TB. Error, noise and bias in de novo transcriptome assemblies. Mol Ecol Resour 2020; 21:18-29. [PMID: 32180366 DOI: 10.1111/1755-0998.13156] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Revised: 01/25/2020] [Accepted: 03/10/2020] [Indexed: 12/21/2022]
Abstract
De novo transcriptome assembly is a powerful tool, and has been widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30% to 83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent underestimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map-to-reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein-coding genes, including many that are highly expressed. We demonstrate a computational method, length-rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.
Collapse
Affiliation(s)
- Adam H Freedman
- Faculty of Arts and Sciences Informatics Group, Harvard University, Cambridge, MA, USA
| | - Michele Clamp
- Faculty of Arts and Sciences Informatics Group, Harvard University, Cambridge, MA, USA
| | - Timothy B Sackton
- Faculty of Arts and Sciences Informatics Group, Harvard University, Cambridge, MA, USA
| |
Collapse
|
39
|
Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. J Comput Biol 2020; 27:485-499. [PMID: 32176522 PMCID: PMC7185321 DOI: 10.1089/cmb.2019.0322] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
Collapse
Affiliation(s)
- Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, Maryland
| | - Prashant Pandey
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| | - Michael Ferdman
- Department of Computer Science, Stony Brook University, Stony Brook, New York
| | - Rob Johnson
- Department of Computer Science, Stony Brook University, Stony Brook, New York
- VMware Research, Palo Alto, California
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland
| |
Collapse
|
40
|
Calcino AD, de Oliveira AL, Simakov O, Schwaha T, Zieger E, Wollesen T, Wanninger A. The quagga mussel genome and the evolution of freshwater tolerance. DNA Res 2020; 26:411-422. [PMID: 31504356 PMCID: PMC6796509 DOI: 10.1093/dnares/dsz019] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2019] [Accepted: 08/01/2019] [Indexed: 02/06/2023] Open
Abstract
Freshwater dreissenid mussels evolved from marine ancestors during the Miocene ∼30 million years ago and today include some of the most successful and destructive invasive species of freshwater environments. Here, we sequenced the genome of the quagga mussel Dreissena rostriformis to identify adaptations involved in embryonic osmoregulation. We provide evidence that a lophotrochozoan-specific aquaporin water channel, a vacuolar ATPase subunit and a sodium/hydrogen exchanger are involved in osmoregulation throughout early cleavage, during which time large intercellular fluid-filled 'cleavage cavities' repeatedly form, coalesce and collapse, expelling excess water to the exterior. Independent expansions of aquaporins coinciding with at least five freshwater colonization events confirm their role in freshwater adaptation. Repeated aquaporin expansions and the evolution of membrane-bound fluid-filled osmoregulatory structures in diverse freshwater taxa point to a fundamental principle guiding the evolution of freshwater tolerance and provide a framework for future species control efforts.
Collapse
Affiliation(s)
- Andrew D Calcino
- Department of Integrative Zoology, University of Vienna, Vienna, Austria
| | | | - Oleg Simakov
- Department of Molecular Evolution and Development, University of Vienna, Vienna, Austria
| | - Thomas Schwaha
- Department of Integrative Zoology, University of Vienna, Vienna, Austria
| | - Elisabeth Zieger
- Department of Integrative Zoology, University of Vienna, Vienna, Austria
| | - Tim Wollesen
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Andreas Wanninger
- Department of Integrative Zoology, University of Vienna, Vienna, Austria
| |
Collapse
|
41
|
De novo transcriptome assembly and analysis of Phragmites karka, an invasive halophyte, to study the mechanism of salinity stress tolerance. Sci Rep 2020; 10:5192. [PMID: 32251358 PMCID: PMC7089983 DOI: 10.1038/s41598-020-61857-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 02/27/2020] [Indexed: 01/04/2023] Open
Abstract
With the rapidly deteriorating environmental conditions, the development of stress tolerant plants has become a priority for sustaining agricultural productivity. Therefore, studying the process of stress tolerance in naturally tolerant species hold significant promise. Phragmites karka is an invasive plant species found abundantly in tropical and sub tropical regions, fresh water regions and brackish marshy areas, such as river banks and lake shores. The plant possesses the ability to adapt and survive under conditions of high salinity. We subjected P. karka seedlings to salt stress and carried out whole transcriptome profiling of leaf and root tissues. Assessing the global transcriptome changes under salt stress resulted in the identification of several genes that are differentially regulated under stress conditions in root and leaf tissue. A total of 161,403 unigenes were assembled and used as a reference for digital gene expression analysis. A number of key metabolic pathways were found to be over-represented. Digital gene expression analysis was validated using qRT-PCR. In addition, a number of different transcription factor families including WRKY, MYB, CCCH, NAC etc. were differentially expressed under salinity stress. Our data will facilitate further characterisation of genes involved in salinity stress tolerance in P. karka. The DEGs from our results are potential candidates for understanding and engineering abiotic stress tolerance in plants.
Collapse
|
42
|
Bushmanova E, Antipov D, Lapidus A, Prjibelski AD. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. Gigascience 2020; 8:5559527. [PMID: 31494669 PMCID: PMC6736328 DOI: 10.1093/gigascience/giz100] [Citation(s) in RCA: 406] [Impact Index Per Article: 81.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2018] [Revised: 04/20/2019] [Accepted: 08/01/2019] [Indexed: 12/18/2022] Open
Abstract
Background The possibility of generating large RNA-sequencing datasets has led to development of various reference-based and de novo transcriptome assemblers with their own strengths and limitations. While reference-based tools are widely used in various transcriptomic studies, their application is limited to the organisms with finished and well-annotated genomes. De novo transcriptome reconstruction from short reads remains an open challenging problem, which is complicated by the varying expression levels across different genes, alternative splicing, and paralogous genes. Results Herein we describe the novel transcriptome assembler rnaSPAdes, which has been developed on top of the SPAdes genome assembler and explores computational parallels between assembly of transcriptomes and single-cell genomes. We also present quality assessment reports for rnaSPAdes assemblies, compare it with modern transcriptome assembly tools using several evaluation approaches on various RNA-sequencing datasets, and briefly highlight strong and weak points of different assemblers. Conclusions Based on the performed comparison between different assembly methods, we infer that it is not possible to detect the absolute leader according to all quality metrics and all used datasets. However, rnaSPAdes typically outperforms other assemblers by such important property as the number of assembled genes and isoforms, and at the same time has higher accuracy statistics on average comparing to the closest competitors.
Collapse
Affiliation(s)
- Elena Bushmanova
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, 199004, 6 linia V.O. 11d, Russia
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, 199004, 6 linia V.O. 11d, Russia
| | - Alla Lapidus
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, 199004, 6 linia V.O. 11d, Russia
| | - Andrey D Prjibelski
- Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, 199004, 6 linia V.O. 11d, Russia
| |
Collapse
|
43
|
Hu XL, Tang YY, Kwok ML, Chan KM, Chu KH. Impact of juvenile hormone analogue insecticides on the water flea Moina macrocopa: Growth, reproduction and transgenerational effect. AQUATIC TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2020; 220:105402. [PMID: 31927065 DOI: 10.1016/j.aquatox.2020.105402] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2019] [Revised: 12/20/2019] [Accepted: 01/01/2020] [Indexed: 06/10/2023]
Abstract
The increasing quantities of insecticides that leach into water bodies severely affect the health of the aquatic environment. Juvenile hormone analogue (JHA) insecticides are endocrine disrupters that interfere with hormonal activity in insects by mimicking juvenile hormones (JHs). Because the structure and functions of methyl farnesoate in crustaceans are similar to the insect JHs, exogenous JHA insecticides may cause adverse effects on the growth and reproduction in crustaceans similar to those observed in insects. This study examined the toxic effects of two JHA insecticides, methoprene and fenoxycarb, on the water flea Moina macrocopa. The 24-h and 48-h LC50 values for fenoxycarb and methoprene were 0.53 and 0.32 mg/L and 0.70 and 0.54 mg/L, respectively. Chronic exposure to the two JHAs caused a series of toxic effects in M. macrocopa, including shortening of life expectancy, repression of body growth, reduction in fecundity, and disturbed the expression of genes involved in the JH signaling pathway, in cuticle development, and in the carbohydrate, amino acid, and ATP metabolic processes. Moreover, JHA exposure impaired the growth and reproduction of the offspring of M. macrocopa exposed to JHAs, even when the neonates were not exposed to the chemicals. In addition, changes in the expression of genes related to histone methylation indicate that epigenetic changes may promote transgenerational impairment in M. macrocopa. These results demonstrate the toxic effects of fenoxycarb and methoprene on non-target aquatic organisms. The damages done by these JHA insecticides to the aquatic environment is worthy of our attention and further studies.
Collapse
Affiliation(s)
- Xue Lei Hu
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Yuan Yuan Tang
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Man Long Kwok
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - King Ming Chan
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| | - Ka Hou Chu
- School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China.
| |
Collapse
|
44
|
Chowdhury HA, Bhattacharyya DK, Kalita JK. Differential Expression Analysis of RNA-seq Reads: Overview, Taxonomy, and Tools. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:566-586. [PMID: 30281477 DOI: 10.1109/tcbb.2018.2873010] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Analysis of RNA-sequence (RNA-seq) data is widely used in transcriptomic studies and it has many applications. We review RNA-seq data analysis from RNA-seq reads to the results of differential expression analysis. In addition, we perform a descriptive comparison of tools used in each step of RNA-seq data analysis along with a discussion of important characteristics of these tools. A taxonomy of tools is also provided. A discussion of issues in quality control and visualization of RNA-seq data is also included along with useful tools. Finally, we provide some guidelines for the RNA-seq data analyst, along with research issues and challenges which should be addressed.
Collapse
|
45
|
Kashyap A, Rhodes A, Kronmiller B, Berger J, Champagne A, Davis EW, Finnegan MV, Geniza M, Hendrix DA, Löhr CV, Petro VM, Sharpton TJ, Wells J, Epps CW, Jaiswal P, Tyler BM, Ramsey SA. Pan-tissue transcriptome analysis of long noncoding RNAs in the American beaver Castor canadensis. BMC Genomics 2020; 21:153. [PMID: 32050897 PMCID: PMC7014947 DOI: 10.1186/s12864-019-6432-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Accepted: 12/26/2019] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Long noncoding RNAs (lncRNAs) have roles in gene regulation, epigenetics, and molecular scaffolding and it is hypothesized that they underlie some mammalian evolutionary adaptations. However, for many mammalian species, the absence of a genome assembly precludes the comprehensive identification of lncRNAs. The genome of the American beaver (Castor canadensis) has recently been sequenced, setting the stage for the systematic identification of beaver lncRNAs and the characterization of their expression in various tissues. The objective of this study was to discover and profile polyadenylated lncRNAs in the beaver using high-throughput short-read sequencing of RNA from sixteen beaver tissues and to annotate the resulting lncRNAs based on their potential for orthology with known lncRNAs in other species. RESULTS Using de novo transcriptome assembly, we found 9528 potential lncRNA contigs and 187 high-confidence lncRNA contigs. Of the high-confidence lncRNA contigs, 147 have no known orthologs (and thus are putative novel lncRNAs) and 40 have mammalian orthologs. The novel lncRNAs mapped to the Oregon State University (OSU) reference beaver genome with greater than 90% sequence identity. While the novel lncRNAs were on average shorter than their annotated counterparts, they were similar to the annotated lncRNAs in terms of the relationships between contig length and minimum free energy (MFE) and between coverage and contig length. We identified beaver orthologs of known lncRNAs such as XIST, MEG3, TINCR, and NIPBL-DT. We profiled the expression of the 187 high-confidence lncRNAs across 16 beaver tissues (whole blood, brain, lung, liver, heart, stomach, intestine, skeletal muscle, kidney, spleen, ovary, placenta, castor gland, tail, toe-webbing, and tongue) and identified both tissue-specific and ubiquitous lncRNAs. CONCLUSIONS To our knowledge this is the first report of systematic identification of lncRNAs and their expression atlas in beaver. LncRNAs-both novel and those with known orthologs-are expressed in each of the beaver tissues that we analyzed. For some beaver lncRNAs with known orthologs, the tissue-specific expression patterns were phylogenetically conserved. The lncRNA sequence data files and raw sequence files are available via the web supplement and the NCBI Sequence Read Archive, respectively.
Collapse
Affiliation(s)
- Amita Kashyap
- Department of Biomedical Sciences, Oregon State University, Corvallis, OR, USA
| | - Adelaide Rhodes
- Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | - Brent Kronmiller
- Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | - Josie Berger
- College of Forestry, Oregon State University, Corvallis, OR, USA
| | - Ashley Champagne
- College of Forestry, Oregon State University, Corvallis, OR, USA
| | - Edward W Davis
- Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | | | - Matthew Geniza
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - David A Hendrix
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, OR, USA.,School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Christiane V Löhr
- Department of Biomedical Sciences, Oregon State University, Corvallis, OR, USA
| | - Vanessa M Petro
- College of Forestry, Oregon State University, Corvallis, OR, USA
| | - Thomas J Sharpton
- Department of Microbiology, Oregon State University, Corvallis, OR, USA.,Department of Statistics, Oregon State University, Corvallis, OR, USA
| | - Jackson Wells
- Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA
| | - Clinton W Epps
- Department of Fisheries and Wildlife, Oregon State University, Corvallis, OR, USA
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Brett M Tyler
- Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR, USA.,Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, USA
| | - Stephen A Ramsey
- Department of Biomedical Sciences, Oregon State University, Corvallis, OR, USA. .,School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
| |
Collapse
|
46
|
Luo Y, Liao X, Wu FX, Wang J. Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190410155603] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Transcriptome assembly plays a critical role in studying biological properties and
examining the expression levels of genomes in specific cells. It is also the basis of many
downstream analyses. With the increase of speed and the decrease in cost, massive sequencing
data continues to accumulate. A large number of assembly strategies based on different
computational methods and experiments have been developed. How to efficiently perform
transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the
issues with transcriptome assembly are explored based on different sequencing technologies.
Specifically, transcriptome assemblies with next-generation sequencing reads are divided into
reference-based assemblies and de novo assemblies. The examples of different species are used to
illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength
transcripts without assemblies. In addition, different transcriptome assemblies using the
Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions
of transcriptome assemblies.
Collapse
Affiliation(s)
- Yuwen Luo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xingyu Liao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatchewan, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
47
|
Zhao J, Feng H, Zhu D, Zhang C, Xu Y. DTA-SiST: de novo transcriptome assembly by using simplified suffix trees. BMC Bioinformatics 2019; 20:698. [PMID: 31874618 PMCID: PMC6929406 DOI: 10.1186/s12859-019-3272-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Alternative splicing allows the pre-mRNAs of a gene to be spliced into various mRNAs, which greatly increases the diversity of proteins. High-throughput sequencing of mRNAs has revolutionized our ability for transcripts reconstruction. However, the massive size of short reads makes de novo transcripts assembly an algorithmic challenge. Results We develop a novel radical framework, called DTA-SiST, for de novo transcriptome assembly based on suffix trees. DTA-SiST first extends contigs by reads that have the longest overlaps with the contigs’ terminuses. These reads can be found in linear time of the lengths of the reads through a well-designed suffix tree structure. Then, DTA-SiST constructs splicing graphs based on contigs for each gene locus. Finally, DTA-SiST proposes two strategies to extract transcript-representing paths: a depth-first enumeration strategy and a hybrid strategy based on length and coverage. We implemented the above two strategies and compared them with the state-of-the-art de novo assemblers on both simulated and real datasets. Experimental results showed that the depth-first enumeration strategy performs always better with recall and also better with precision for smaller datasets while the hybrid strategy leads with precision for big datasets. Conclusions DTA-SiST performs more competitive than the other compared de novo assemblers especially with precision measure, due to the read-based contig extension strategy and the elegant transcripts extraction rules.
Collapse
Affiliation(s)
- Jin Zhao
- School of Computer Science and Technology, Shandong University, Binhai Road, Qingdao, Shandong, People's Republic of China
| | - Haodi Feng
- School of Computer Science and Technology, Shandong University, Binhai Road, Qingdao, Shandong, People's Republic of China.
| | - Daming Zhu
- School of Computer Science and Technology, Shandong University, Binhai Road, Qingdao, Shandong, People's Republic of China
| | - Chi Zhang
- Department of Medical and Molecular Genetics and Center for Computational Biology and Bioinformatics, Indiana University, Indianapolis, IN, USA
| | - Ying Xu
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA
| |
Collapse
|
48
|
Comparative Analysis of Strategies for De Novo Transcriptome Assembly in Prokaryotes: Streptomyces clavuligerus as a Case Study. High Throughput 2019; 8:ht8040020. [PMID: 31801255 PMCID: PMC6970227 DOI: 10.3390/ht8040020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 11/20/2019] [Accepted: 11/23/2019] [Indexed: 12/15/2022] Open
Abstract
The performance of software tools for de novo transcriptome assembly greatly depends on the selection of software parameters. Up to now, the development of de novo transcriptome assembly for prokaryotes has not been as remarkable as that for eukaryotes. In this contribution, Rockhopper2 was used to perform a comparative transcriptome analysis of Streptomyces clavuligerus exposed to diverse environmental conditions. The study focused on assessing the incidence of software parameters on software performance for the identification of differentially expressed genes as a final goal. For this, a statistical optimization was performed using the Transrate Assembly Score (TAS). TAS was also used for evaluating the software performance and for comparing it with related tools, e.g., Trinity. Transcriptome redundancy and completeness were also considered for this analysis. Rockhopper2 and Trinity reached a TAS value of 0.55092 and 0.58337, respectively. Trinity assembles transcriptomes with high redundancy, with 55.6% of transcripts having some duplicates. Additionally, we observed that the total number of differentially expressed genes (DEG) and their annotation greatly depends on the method used for removing redundancy and the tools used for transcript quantification. To our knowledge, this is the first work aimed at assessing de novo assembly software for prokaryotic organisms.
Collapse
|
49
|
Nystrom GS, Ward MJ, Ellsworth SA, Rokyta DR. Sex-based venom variation in the eastern bark centipede (Hemiscolopendra marginata). Toxicon 2019; 169:45-58. [DOI: 10.1016/j.toxicon.2019.08.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 07/25/2019] [Accepted: 08/08/2019] [Indexed: 11/15/2022]
|
50
|
Liu J, Liu X, Ren X, Li G. scRNAss: a single-cell RNA-seq assembler via imputing dropouts and combing junctions. Bioinformatics 2019; 35:4264-4271. [PMID: 30951147 DOI: 10.1093/bioinformatics/btz240] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 12/17/2018] [Accepted: 04/02/2019] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Full-length transcript reconstruction is essential for single-cell RNA-seq data analysis, but dropout events, which can cause transcripts discarded completely or broken into pieces, pose great challenges for transcript assembly. Currently available RNA-seq assemblers are generally designed for bulk RNA sequencing. To fill the gap, we introduce single-cell RNA-seq assembler, a method that applies explicit strategies to impute lost information caused by dropout events and a combing strategy to infer transcripts using scRNA-seq. RESULTS Extensive evaluations on both simulated and biological datasets demonstrated its superiority over the state-of-the-art RNA-seq assemblers including StringTie, Cufflinks and CLASS2. In particular, it showed a remarkable capability of recovering unknown 'novel' isoforms and highly computational efficiency compared to other tools. AVAILABILITY AND IMPLEMENTATION scRNAss is free, open-source software available from https://sourceforge.net/projects/single-cell-rna-seq-assembly/files/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Juntao Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Xiangyu Liu
- School of Mathematics, Shandong University, Jinan, China
| | - Xianwen Ren
- Biomedical Pioneering Innovation Center, Beijing Advanced Innovation Center for Genomics, and School of Life Sciences, Peking University, Beijing, China
| | - Guojun Li
- School of Mathematics, Shandong University, Jinan, China
| |
Collapse
|