1
|
Li P, Zhao Z, Wang W, Wang T, Hu N, Wei Y, Sun Z, Chen Y, Li Y, Liu Q, Yang S, Gong J, Xiao X, Liu Y, Shi Y, Peng R, Lu Q, Yuan Y. Genome-wide analyses of member identification, expression pattern, and protein-protein interaction of EPF/EPFL gene family in Gossypium. BMC PLANT BIOLOGY 2024; 24:554. [PMID: 38877405 PMCID: PMC11177404 DOI: 10.1186/s12870-024-05262-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 06/06/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Epidermal patterning factor / -like (EPF/EPFL) gene family encodes a class of cysteine-rich secretory peptides, which are widelyfound in terrestrial plants.Multiple studies has indicated that EPF/EPFLs might play significant roles in coordinating plant development and growth, especially as the morphogenesis processes of stoma, awn, stamen, and fruit skin. However, few research on EPF/EPFL gene family was reported in Gossypium. RESULTS We separately identified 20 G. raimondii, 24 G. arboreum, 44 G. hirsutum, and 44 G. barbadense EPF/EPFL genes in the 4 representative cotton species, which were divided into four clades together with 11 Arabidopsis thaliana, 13 Oryza sativa, and 17 Selaginella moellendorffii ones based on their evolutionary relationships. The similar gene structure and common motifs indicated the high conservation among the EPF/EPFL members, while the uneven distribution in chromosomes implied the variability during the long-term evolutionary process. Hundreds of collinearity relationships were identified from the pairwise comparisons of intraspecifc and interspecific genomes, which illustrated gene duplication might contribute to the expansion of cotton EPF/EPFL gene family. A total of 15 kinds of cis-regulatory elements were predicted in the promoter regions, and divided into three major categories relevant to the biological processes of development and growth, plant hormone response, and abiotic stress response. Having performing the expression pattern analyses with the basic of the published RNA-seq data, we found most of GhEPF/EPFL and GbEPF/EPFL genes presented the relatively low expression levels among the 9 tissues or organs, while showed more dramatically different responses to high/low temperature and salt or drought stresses. Combined with transcriptome data of developing ovules and fibers and quantitative Real-time PCR results (qRT-PCR) of 15 highly expressed GhEPF/EPFL genes, it could be deduced that the cotton EPF/EPFL genes were closely related with fiber development. Additionally, the networks of protein-protein interacting among EPF/EPFLs concentrated on the cores of GhEPF1 and GhEPF7, and thosefunctional enrichment analyses indicated that most of EPF/EPFLs participate in the GO (Gene Ontology) terms of stomatal development and plant epidermis development, and the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways of DNA or base excision repair. CONCLUSION Totally, 132 EPF/EPFL genes were identified for the first time in cotton, whose bioinformatic analyses of cis-regulatory elements and expression patterns combined with qRT-PCR experiments to prove the potential functions in the biological processes of plant growth and responding to abiotic stresses, specifically in the fiber development. These results not only provide comprehensive and valuable information for cotton EPF/EPFL gene family, but also lay solid foundation for screening candidate EPF/EPFL genes in further cotton breeding.
Collapse
Affiliation(s)
- Pengtao Li
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Zilin Zhao
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Wenkui Wang
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Tao Wang
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Nan Hu
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Yangyang Wei
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Zhihao Sun
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Yu Chen
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Yanfang Li
- College of Agriculture, Tarim University, Alaer , Xinjiang, 843300, China
| | - Qiankun Liu
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Shuhan Yang
- College of Agriculture, Tarim University, Alaer , Xinjiang, 843300, China
| | - Juwu Gong
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Xianghui Xiao
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Yuling Liu
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Yuzhen Shi
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China
| | - Renhai Peng
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China
| | - Quanwei Lu
- School of Biotechnology and Food Engineering, Anyang Institute of Technology, Anyang , Henan, 455000, China.
| | - Youlu Yuan
- National Key Laboratory of Cotton Bio-breeding and Integrated Utilization, Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, Henan, 455000, China.
| |
Collapse
|
2
|
Van R, Alvarez D, Mize T, Gannavarapu S, Chintham Reddy L, Nasoz F, Han MV. A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies. BMC Bioinformatics 2024; 25:181. [PMID: 38720247 PMCID: PMC11080237 DOI: 10.1186/s12859-024-05801-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.
Collapse
Affiliation(s)
- Richard Van
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Daniel Alvarez
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Travis Mize
- Icahn School of Medicine at Mount Sinai, Institute for Genomic Health, New York City, NY, USA
| | - Sravani Gannavarapu
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Lohitha Chintham Reddy
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Fatma Nasoz
- Department of Computer Science, University of Nevada Las Vegas, Las Vegas, NV, USA
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA
| | - Mira V Han
- School of Life Sciences, University of Nevada Las Vegas, Las Vegas, NV, USA.
- Nevada Institute of Personalized Medicine, Las Vegas, NV, USA.
| |
Collapse
|
3
|
Ni Y, Liu X, Simeneh ZM, Yang M, Li R. Benchmarking of Nanopore R10.4 and R9.4.1 flow cells in single-cell whole-genome amplification and whole-genome shotgun sequencing. Comput Struct Biotechnol J 2023; 21:2352-2364. [PMID: 37025654 PMCID: PMC10070092 DOI: 10.1016/j.csbj.2023.03.038] [Citation(s) in RCA: 71] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2022] [Revised: 03/21/2023] [Accepted: 03/22/2023] [Indexed: 03/30/2023] Open
Abstract
Third-generation sequencing can be used in human cancer genomics and epigenomic research. Oxford Nanopore Technologies (ONT) recently released R10.4 flow cell, which claimed an improved read accuracy compared to R9.4.1 flow cell. To evaluate the benefits and defects of R10.4 flow cell for cancer cell profiling on MinION devices, we used the human non-small-cell lung-carcinoma cell line HCC78 to construct libraries for both single-cell whole-genome amplification (scWGA) and whole-genome shotgun sequencing. The R10.4 and R9.4.1 reads were benchmarked in terms of read accuracy, variant detection, modification calling, genome recovery rate and compared with the next generation sequencing (NGS) reads. The results highlighted that the R10.4 outperforms R9.4.1 reads, achieving a higher modal read accuracy of over 99.1%, superior variation detection, lower false-discovery rate (FDR) in methylation calling, and comparable genome recovery rate. To achieve high yields scWGA sequencing in the ONT platform as NGS, we recommended multiple displacement amplification with a modified T7 endonuclease Ⅰ cutting procedure as a promising method. In addition, we provided a possible solution to filter the likely false positive sites among the whole genome region with R10.4 by using scWGA sequencing result as a negative control. Our study is the first benchmark of whole genome single-cell sequencing using ONT R10.4 and R9.4.1 MinION flow cells by clarifying the capacity of genomic and epigenomic profiling within a single flow cell. A promising method for scWGA sequencing together with the methylation calling results can benefit researchers who work on cancer cell genomic and epigenomic profiling using third-generation sequencing.
Collapse
Affiliation(s)
- Ying Ni
- Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China
- Department of Biomedical Sciences and Tung Biomedical Sciences Centre, City University of Hong Kong, Hong Kong, China
| | - Xudong Liu
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China
| | - Zemenu Mengistie Simeneh
- Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China
- Department of Biomedical Sciences and Tung Biomedical Sciences Centre, City University of Hong Kong, Hong Kong, China
| | - Mengsu Yang
- Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China
- Department of Biomedical Sciences and Tung Biomedical Sciences Centre, City University of Hong Kong, Hong Kong, China
- Corresponding author at: Department of Precision Diagnostic and Therapeutic Technology, City University of Hong Kong Shenzhen Futian Research Institute, Shenzhen, Guangdong, China.
| | - Runsheng Li
- Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China
- Southern Marine Science and Engineering Guangdong Laboratory, Guangzhou, China
- Corresponding author at: Department of Infectious Diseases and Public Health, Jockey Club College of Veterinary Medicine and Life Sciences, City University of Hong Kong, Hong Kong, China.
| |
Collapse
|
4
|
Wynn EA, Vestal BE, Fingerlin TE, Moore CM. A comparison of methods for multiple degree of freedom testing in repeated measures RNA-sequencing experiments. BMC Med Res Methodol 2022; 22:153. [PMID: 35643435 PMCID: PMC9148455 DOI: 10.1186/s12874-022-01615-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 04/14/2022] [Indexed: 11/10/2022] Open
Abstract
Background As the cost of RNA-sequencing decreases, complex study designs, including paired, longitudinal, and other correlated designs, become increasingly feasible. These studies often include multiple hypotheses and thus multiple degree of freedom tests, or tests that evaluate multiple hypotheses jointly, are often useful for filtering the gene list to a set of interesting features for further exploration while controlling the false discovery rate. Though there are several methods which have been proposed for analyzing correlated RNA-sequencing data, there has been little research evaluating and comparing the performance of multiple degree of freedom tests across methods. Methods We evaluated 11 different methods for modelling correlated RNA-sequencing data by performing a simulation study to compare the false discovery rate, power, and model convergence rate across several hypothesis tests and sample size scenarios. We also applied each method to a real longitudinal RNA-sequencing dataset. Results Linear mixed modelling using transformed data had the best false discovery rate control while maintaining relatively high power. However, this method had high model non-convergence, particularly at small sample sizes. No method had high power at the lowest sample size. We found a mix of conservative and anti-conservative behavior across the other methods, which was influenced by the sample size and the hypothesis being evaluated. The patterns observed in the simulation study were largely replicated in the analysis of a longitudinal study including data from intensive care unit patients experiencing cardiogenic or septic shock. Conclusions Multiple degree of freedom testing is a valuable tool in longitudinal and other correlated RNA-sequencing experiments. Of the methods that we investigated, linear mixed modelling had the best overall combination of power and false discovery rate control. Other methods may also be appropriate in some scenarios. Supplementary Information The online version contains supplementary material available at (10.1186/s12874-022-01615-8).
Collapse
|
5
|
Generative Adversarial Networks for Creating Synthetic Nucleic Acid Sequences of Cat Genome. Int J Mol Sci 2022; 23:ijms23073701. [PMID: 35409058 PMCID: PMC8998662 DOI: 10.3390/ijms23073701] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 03/23/2022] [Accepted: 03/25/2022] [Indexed: 01/04/2023] Open
Abstract
Nucleic acids are the basic units of deoxyribonucleic acid (DNA) sequencing. Every organism demonstrates different DNA sequences with specific nucleotides. It reveals the genetic information carried by a particular DNA segment. Nucleic acid sequencing expresses the evolutionary changes among organisms and revolutionizes disease diagnosis in animals. This paper proposes a generative adversarial networks (GAN) model to create synthetic nucleic acid sequences of the cat genome tuned to exhibit specific desired properties. We obtained the raw sequence data from Illumina next generation sequencing. Various data preprocessing steps were performed using Cutadapt and DADA2 tools. The processed data were fed to the GAN model that was designed following the architecture of Wasserstein GAN with gradient penalty (WGAN-GP). We introduced a predictor and an evaluator in our proposed GAN model to tune the synthetic sequences to acquire certain realistic properties. The predictor was built for extracting samples with a promoter sequence, and the evaluator was built for filtering samples that scored high for motif-matching. The filtered samples were then passed to the discriminator. We evaluated our model based on multiple metrics and demonstrated outputs for latent interpolation, latent complementation, and motif-matching. Evaluation results showed our proposed GAN model achieved 93.7% correlation with the original data and produced significant outcomes as compared to existing models for sequence generation.
Collapse
|
6
|
Music of metagenomics-a review of its applications, analysis pipeline, and associated tools. Funct Integr Genomics 2021; 22:3-26. [PMID: 34657989 DOI: 10.1007/s10142-021-00810-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 09/25/2021] [Accepted: 10/03/2021] [Indexed: 10/20/2022]
Abstract
This humble effort highlights the intricate details of metagenomics in a simple, poetic, and rhythmic way. The paper enforces the significance of the research area, provides details about major analytical methods, examines the taxonomy and assembly of genomes, emphasizes some tools, and concludes by celebrating the richness of the ecosystem populated by the "metagenome."
Collapse
|
7
|
A censored-Poisson model based approach to the analysis of RNA-seq data. QUANTITATIVE BIOLOGY 2020. [DOI: 10.1007/s40484-020-0208-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
8
|
Alkhateeb A, Rezaeian I, Singireddy S, Cavallo-Medved D, Porter LA, Rueda L. Transcriptomics Signature from Next-Generation Sequencing Data Reveals New Transcriptomic Biomarkers Related to Prostate Cancer. Cancer Inform 2019; 18:1176935119835522. [PMID: 30890858 PMCID: PMC6416685 DOI: 10.1177/1176935119835522] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Accepted: 01/23/2019] [Indexed: 12/11/2022] Open
Abstract
Prostate cancer is one of the most common types of cancer among Canadian men. Next-generation sequencing using RNA-Seq provides large amounts of data that may reveal novel and informative biomarkers. We introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have tremendous value in guiding treatment decisions. Analysis of normal versus malignant prostate cancer data sets indicates differential expression of the genes HEATR5B, DDC, and GABPB1-AS1 as potential prostate cancer biomarkers. Our study also supports PTGFR, NREP, SCARNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease.
Collapse
Affiliation(s)
| | - Iman Rezaeian
- School of Computer Science, University
of Windsor, Windsor, ON, Canada
| | - Siva Singireddy
- School of Computer Science, University
of Windsor, Windsor, ON, Canada
| | - Dora Cavallo-Medved
- Department of Biological Sciences,
University of Windsor, Windsor, ON, Canada
| | - Lisa A Porter
- Department of Biological Sciences,
University of Windsor, Windsor, ON, Canada
| | - Luis Rueda
- School of Computer Science, University
of Windsor, Windsor, ON, Canada
| |
Collapse
|
9
|
Campbell CR, Poelstra JW, Yoder AD. What is Speciation Genomics? The roles of ecology, gene flow, and genomic architecture in the formation of species. Biol J Linn Soc Lond 2018. [DOI: 10.1093/biolinnean/bly063] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Affiliation(s)
| | - J W Poelstra
- Department of Biology, Duke University, Durham, NC, USA
| | - Anne D Yoder
- Department of Biology, Duke University, Durham, NC, USA
| |
Collapse
|
10
|
Costa-Silva J, Domingues D, Lopes FM. RNA-Seq differential expression analysis: An extended review and a software tool. PLoS One 2017; 12:e0190152. [PMID: 29267363 PMCID: PMC5739479 DOI: 10.1371/journal.pone.0190152] [Citation(s) in RCA: 338] [Impact Index Per Article: 42.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Accepted: 12/08/2017] [Indexed: 12/22/2022] Open
Abstract
The correct identification of differentially expressed genes (DEGs) between specific conditions is a key in the understanding phenotypic variation. High-throughput transcriptome sequencing (RNA-Seq) has become the main option for these studies. Thus, the number of methods and softwares for differential expression analysis from RNA-Seq data also increased rapidly. However, there is no consensus about the most appropriate pipeline or protocol for identifying differentially expressed genes from RNA-Seq data. This work presents an extended review on the topic that includes the evaluation of six methods of mapping reads, including pseudo-alignment and quasi-mapping and nine methods of differential expression analysis from RNA-Seq data. The adopted methods were evaluated based on real RNA-Seq data, using qRT-PCR data as reference (gold-standard). As part of the results, we developed a software that performs all the analysis presented in this work, which is freely available at https://github.com/costasilvati/consexpression. The results indicated that mapping methods have minimal impact on the final DEGs analysis, considering that adopted data have an annotated reference genome. Regarding the adopted experimental model, the DEGs identification methods that have more consistent results were the limma+voom, NOIseq and DESeq2. Additionally, the consensus among five DEGs identification methods guarantees a list of DEGs with great accuracy, indicating that the combination of different methods can produce more suitable results. The consensus option is also included for use in the available software.
Collapse
Affiliation(s)
- Juliana Costa-Silva
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| | - Douglas Domingues
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
- Department of Botany, Institute of Biosciences, São Paulo State University, UNESP, Rio Claro - SP, Brazil
| | - Fabricio Martins Lopes
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
- * E-mail:
| |
Collapse
|