1
|
La Ferlita A, Alaimo S, Di Bella S, Martorana E, Laliotis GI, Bertoni F, Cascione L, Tsichlis PN, Ferro A, Bosotti R, Pulvirenti A. RNAdetector: a free user-friendly stand-alone and cloud-based system for RNA-Seq data analysis. BMC Bioinformatics 2021; 22:298. [PMID: 34082707 PMCID: PMC8173825 DOI: 10.1186/s12859-021-04211-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Accepted: 05/20/2021] [Indexed: 12/13/2022] Open
Abstract
Background RNA-Seq is a well-established technology extensively used for transcriptome profiling, allowing the analysis of coding and non-coding RNA molecules. However, this technology produces a vast amount of data requiring sophisticated computational approaches for their analysis than other traditional technologies such as Real-Time PCR or microarrays, strongly discouraging non-expert users. For this reason, dozens of pipelines have been deployed for the analysis of RNA-Seq data. Although interesting, these present several limitations and their usage require a technical background, which may be uncommon in small research laboratories. Therefore, the application of these technologies in such contexts is still limited and causes a clear bottleneck in knowledge advancement. Results Motivated by these considerations, we have developed RNAdetector, a new free cross-platform and user-friendly RNA-Seq data analysis software that can be used locally or in cloud environments through an easy-to-use Graphical User Interface allowing the analysis of coding and non-coding RNAs from RNA-Seq datasets of any sequenced biological species. Conclusions RNAdetector is a new software that fills an essential gap between the needs of biomedical and research labs to process RNA-Seq data and their common lack of technical background in performing such analysis, which usually relies on outsourcing such steps to third party bioinformatics facilities or using expensive commercial software. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04211-7.
Collapse
Affiliation(s)
- Alessandro La Ferlita
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy.,Department of Cancer Biology and Genetics, The Ohio State University, Columbus, OH, USA.,Department of Physics and Astronomy, University of Catania, Catania, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy
| | | | - Emanuele Martorana
- Regional Referral Centre for Rare Lung Diseases, A. O. U. "Policlinico-Vittorio Emanuele", Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy
| | - Georgios I Laliotis
- Department of Cancer Biology and Genetics, The Ohio State University, Columbus, OH, USA
| | | | | | - Philip N Tsichlis
- Department of Cancer Biology and Genetics, The Ohio State University, Columbus, OH, USA
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy
| | | | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, Bioinformatics Unit, University of Catania, Catania, Italy.
| |
Collapse
|
2
|
Mora-Márquez F, Vázquez-Poletti JL, López de Heredia U. NGScloud2: optimized bioinformatic analysis using Amazon Web Services. PeerJ 2021; 9:e11237. [PMID: 33959420 PMCID: PMC8054753 DOI: 10.7717/peerj.11237] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2020] [Accepted: 03/17/2021] [Indexed: 12/13/2022] Open
Abstract
Background NGScloud was a bioinformatic system developed to perform de novo RNAseq analysis of non-model species by exploiting the cloud computing capabilities of Amazon Web Services. The rapid changes undergone in the way this cloud computing service operates, along with the continuous release of novel bioinformatic applications to analyze next generation sequencing data, have made the software obsolete. NGScloud2 is an enhanced and expanded version of NGScloud that permits the access to ad hoc cloud computing infrastructure, scaled according to the complexity of each experiment. Methods NGScloud2 presents major technical improvements, such as the possibility of running spot instances and the most updated AWS instances types, that can lead to significant cost savings. As compared to its initial implementation, this improved version updates and includes common applications for de novo RNAseq analysis, and incorporates tools to operate workflows of bioinformatic analysis of reference-based RNAseq, RADseq and functional annotation. NGScloud2 optimizes the access to Amazon’s large computing infrastructures to easily run popular bioinformatic software applications, otherwise inaccessible to non-specialized users lacking suitable hardware infrastructures. Results The correct performance of the pipelines for de novo RNAseq, reference-based RNAseq, RADseq and functional annotation was tested with real experimental data, providing workflow performance estimates and tips to make optimal use of NGScloud2. Further, we provide a qualitative comparison of NGScloud2 vs. the Galaxy framework. NGScloud2 code, instructions for software installation and use are available at https://github.com/GGFHF/NGScloud2. NGScloud2 includes a companion package, NGShelper that contains Python utilities to post-process the output of the pipelines for downstream analysis at https://github.com/GGFHF/NGShelper.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| | - José Luis Vázquez-Poletti
- GI Arquitectura de Sistemas Distribuidos, Dpto. de Arquitectura de Ordenadores y Automática, Facultad de Informática, Universidad Complutense de Madrid, Madrid, Spain
| | - Unai López de Heredia
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
3
|
Mora-Márquez F, Chano V, Vázquez-Poletti JL, López de Heredia U. TOA: A software package for automated functional annotation in non-model plant species. Mol Ecol Resour 2020; 21:621-636. [PMID: 33070442 DOI: 10.1111/1755-0998.13285] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 10/01/2020] [Accepted: 10/13/2020] [Indexed: 01/05/2023]
Abstract
The increase of sequencing capacity provided by high-throughput platforms has made it possible to routinely obtain large sets of genomic and transcriptomic sequences from model and non-model organisms. Subsequent genomic analysis and gene discovery in next-generation sequencing experiments are, however, bottlenecked by functional annotation. One common way to perform functional annotation of sets of sequences obtained from next-generation sequencing experiments, is by searching for homologous sequences and accessing the related functional information deposited in genomic databases. Functional annotation is especially challenging for non-model organisms, like many plant species. In such cases, existing free and commercial general-purpose applications may not offer complete and accurate results. We present TOA (Taxonomy-oriented annotation), a Python-based user-friendly open source application designed to establish functional annotation pipelines geared towards non-model plant species that can run in Linux/Mac computers, HPCs and cloud servers. TOA performs homology searches against proteins stored in the PLAZA databases, NCBI RefSeq Plant, Nucleotide Database and Non-Redundant Protein Sequence Database, and outputs functional information from several ontology systems: Gene Ontology, InterPro, EC, KEGG, Mapman and MetaCyc. The software performance was validated by comparing the runtimes, total number of annotated sequences and accuracy of the functional information obtained for several plant benchmark data sets with TOA and other functional annotation solutions. TOA outperformed the other software in terms of number of annotated sequences and accuracy of the annotation and constitutes a good alternative to improve functional annotation in plants. TOA is especially recommended for gymnosperms or for low quality sequence data sets of non-model plants.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| | - Víctor Chano
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| | - José Luis Vázquez-Poletti
- GI Arquitectura de Sistemas Distribuidos, Dpto. Arquitectura de Computadores y Automática, Facultad de Informática, Universidad Complutense de Madrid, Madrid, Spain
| | - Unai López de Heredia
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
4
|
Mora-Márquez F, Vázquez-Poletti JL, Chano V, Collada C, Soto Á, de Heredia UL. Hardware Performance Evaluation of De novo Transcriptome Assembly Software in Amazon Elastic Compute Cloud. Curr Bioinform 2020. [DOI: 10.2174/1574893615666191219095817] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Bioinformatics software for RNA-seq analysis has a high computational
requirement in terms of the number of CPUs, RAM size, and processor characteristics.
Specifically, de novo transcriptome assembly demands large computational infrastructure due to
the massive data size, and complexity of the algorithms employed. Comparative studies on the
quality of the transcriptome yielded by de novo assemblers have been previously published,
lacking, however, a hardware efficiency-oriented approach to help select the assembly hardware
platform in a cost-efficient way.
Objective:
We tested the performance of two popular de novo transcriptome assemblers, Trinity
and SOAPdenovo-Trans (SDNT), in terms of cost-efficiency and quality to assess limitations, and
provided troubleshooting and guidelines to run transcriptome assemblies efficiently.
Methods:
We built virtual machines with different hardware characteristics (CPU number, RAM
size) in the Amazon Elastic Compute Cloud of the Amazon Web Services. Using simulated and
real data sets, we measured the elapsed time, cost, CPU percentage and output size of small and
large data set assemblies.
Results:
For small data sets, SDNT outperformed Trinity by an order the magnitude, significantly
reducing the time duration and costs of the assembly. For large data sets, Trinity performed better
than SDNT. Both the assemblers provide good quality transcriptomes.
Conclusion:
The selection of the optimal transcriptome assembler and provision of computational
resources depend on the combined effect of size and complexity of RNA-seq experiments.
Collapse
Affiliation(s)
- Fernando Mora-Márquez
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - José Luis Vázquez-Poletti
- GI Arquitectura de Sistemas Distribuidos, Dpto. Arquitectura de Computadores y Automatica, Facultad de Informatica, Universidad Complutense de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Víctor Chano
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Carmen Collada
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Álvaro Soto
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| | - Unai López de Heredia
- GI Sistemas Naturales e Historia Forestal, Dpto. Sistemas y Recursos Naturales, ETSI Montes, Forestal y del Medio Natural, Universidad Politecnica de Madrid, Ciudad Universitaria, 28040 Madrid, Spain
| |
Collapse
|