1
|
Dhaliwal J, Wagner J. STR-based feature extraction and selection for genetic feature discovery in neurological disease genes. Sci Rep 2023; 13:2480. [PMID: 36774368 PMCID: PMC9922266 DOI: 10.1038/s41598-023-29376-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 02/03/2023] [Indexed: 02/13/2023] Open
Abstract
Gene expression, often determined by single nucleotide polymorphisms, short repeated sequences known as short tandem repeats (STRs), structural variants, and environmental factors, provides means for an organism to produce gene products necessary to live. Variation in expression levels, sometimes known as enrichment patterns, has been associated with disease progression. Thus, the STR enrichment patterns have recently gained interest as potential genetic markers for disease progression. However, to the best of our knowledge, we are unaware of any study that evaluates and explores STRs, particularly trinucleotide sequences, as machine learning features for classifying neurological disease genes for the purpose of discovering genetic features. Thus, in this paper, we proposed a new metric and a novel feature extraction and selection algorithm based on statistically significant STR-based features and their respective enrichment patterns to create a statistically significant feature set. The proposed new metric has shown that the neurological disease family genes have a non-random AA, AT, TA, TG, and TT enrichment pattern. This is an important result, as it supports prior research that has established that certain trinucleotides, such as AAT, ATA, ATT, TAT, and TTA, are favored during protein misfolding. In contrast, trinucleotides, such as TAA, TAG, and TGA, are favored during premature termination codon mutations as they are stop codons. This suggests that the metric has the potential to identify patterns that may be genetic features in a sample of neurological genes. Moreover, the practical performance and high prediction results of the statistically significant STR-based feature set indicate that variations in STR enrichment patterns can distinguish neurological disease genes. In conclusion, the proposed approach may have the potential to discover differential genetic features for other diseases.
Collapse
Affiliation(s)
- Jasbir Dhaliwal
- Faculty of Information Technology, Monash University, Clayton, VIC, 3800, Australia.
| | - John Wagner
- PsychoGenics Inc., Paramus, New Jersey, 07652, United States of America
| |
Collapse
|
2
|
Korsching E, Matschke J, Hotfilder M. Splice variants denote differences between a cancer stem cell side population of EWSR1‑ERG‑based Ewing sarcoma cells, its main population and EWSR1‑FLI‑based cells. Int J Mol Med 2022; 49:39. [PMID: 35088879 PMCID: PMC8815407 DOI: 10.3892/ijmm.2022.5094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Accepted: 12/17/2021] [Indexed: 11/06/2022] Open
Abstract
Ewing sarcoma is a challenging cancer entity, which, besides the characteristic presence of a fusion gene, is driven by multiple alternative splicing events. So far, splice variants in Ewing sarcoma cells were mainly analyzed for EWSR1‑FLI1. The present study provided a comprehensive alternative splicing study on CADO‑ES1, an Ewing model cell line for an EWSR1‑ERG fusion gene. Based on a well‑-characterized RNA‑sequencing dataset with extensive control mechanisms across all levels of analysis, the differential spliced genes in Ewing cancer stem cells were ATP13A3 and EPB41, while the main population was defined by ACADVL, NOP58 and TSPAN3. All alternatively spliced genes were further characterized by their Gene Ontology (GO) terms and by their membership in known protein complexes. These results confirm and extend previous studies towards a systematic whole‑transcriptome analysis. A highlight is the striking segregation of GO terms associated with five basic splice events. This mechanistic insight, together with a coherent integration of all observations with prior knowledge, indicates that EWSR1‑ERG is truly a close twin to EWSR1‑FLI1, but still exhibits certain individuality. Thus, the present study provided a measure of variability in Ewing sarcoma, whose understanding is essential both for clinical procedures and basic mechanistic insight.
Collapse
Affiliation(s)
- Eberhard Korsching
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, D‑48149 Münster, Germany
| | - Julian Matschke
- Institute of Bioinformatics, Faculty of Medicine, University of Münster, D‑48149 Münster, Germany
| | - Marc Hotfilder
- Department of Pediatric Hematology and Oncology, University Hospital Münster, D‑48149 Münster, Germany
| |
Collapse
|
3
|
Sukhadia SS, Tyagi A, Venkataraman V, Mukherjee P, Prasad P, Gevaert O, Nagaraj SH. ImaGene: a web-based software platform for tumor radiogenomic evaluation and reporting. BIOINFORMATICS ADVANCES 2022; 2:vbac079. [PMID: 36699376 PMCID: PMC9714320 DOI: 10.1093/bioadv/vbac079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 09/26/2022] [Accepted: 11/09/2022] [Indexed: 11/12/2022]
Abstract
Summary Radiographic imaging techniques provide insight into the imaging features of tumor regions of interest, while immunohistochemistry and sequencing techniques performed on biopsy samples yield omics data. Relationships between tumor genotype and phenotype can be identified from these data through traditional correlation analyses and artificial intelligence (AI) models. However, the radiogenomics community lacks a unified software platform with which to conduct such analyses in a reproducible manner. To address this gap, we developed ImaGene, a web-based platform that takes tumor omics and imaging datasets as inputs, performs correlation analysis between them, and constructs AI models. ImaGene has several modifiable configuration parameters and produces a report displaying model diagnostics. To demonstrate the utility of ImaGene, we utilized data for invasive breast carcinoma (IBC) and head and neck squamous cell carcinoma (HNSCC) and identified potential associations between imaging features and nine genes (WT1, LGI3, SP7, DSG1, ORM1, CLDN10, CST1, SMTNL2, and SLC22A31) for IBC and eight genes (NR0B1, PLA2G2A, MAL, CLDN16, PRDM14, VRTN, LRRN1, and MECOM) for HNSCC. ImaGene has the potential to become a standard platform for radiogenomic tumor analyses due to its ease of use, flexibility, and reproducibility, playing a central role in the establishment of an emerging radiogenomic knowledge base. Availability and implementation www.ImaGene.pgxguide.org, https://github.com/skr1/Imagene.git. Supplementary information Supplementary data are available at https://github.com/skr1/Imagene.git.
Collapse
Affiliation(s)
- Shrey S Sukhadia
- Centre for Genomics and Personalised Health, Queensland University of Technology, Brisbane, QLD 4000, Australia.,Translational Research Institute, Brisbane, QLD 4000, Australia
| | - Aayush Tyagi
- Yardi School of Artificial Intelligence, Indian Institute of Technology, New Delhi 110016, India
| | - Vivek Venkataraman
- Centre for Genomics and Personalised Health, Queensland University of Technology, Brisbane, QLD 4000, Australia.,Translational Research Institute, Brisbane, QLD 4000, Australia
| | - Pritam Mukherjee
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Biomedical Data Science, Stanford University, Stanford, CA 94305-5101, USA
| | - Pratosh Prasad
- Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore 560012, India
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine and Biomedical Data Science, Stanford University, Stanford, CA 94305-5101, USA
| | - Shivashankar H Nagaraj
- Centre for Genomics and Personalised Health, Queensland University of Technology, Brisbane, QLD 4000, Australia.,Translational Research Institute, Brisbane, QLD 4000, Australia
| |
Collapse
|
4
|
Baillif B, Wichard J, Méndez-Lucio O, Rouquié D. Exploring the Use of Compound-Induced Transcriptomic Data Generated From Cell Lines to Predict Compound Activity Toward Molecular Targets. Front Chem 2020; 8:296. [PMID: 32391323 PMCID: PMC7191531 DOI: 10.3389/fchem.2020.00296] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Accepted: 03/25/2020] [Indexed: 12/17/2022] Open
Abstract
Pharmaceutical or phytopharmaceutical molecules rely on the interaction with one or more specific molecular targets to induce their anticipated biological responses. Nonetheless, these compounds are also prone to interact with many other non-intended biological targets, also known as off-targets. Unfortunately, off-target identification is difficult and expensive. Consequently, QSAR models predicting the activity on a target have gained importance in drug discovery or in the de-risking of chemicals. However, a restricted number of targets are well characterized and hold enough data to build such in silico models. A good alternative to individual target evaluations is to use integrative evaluations such as transcriptomics obtained from compound-induced gene expression measurements derived from cell cultures. The advantage of these particular experiments is to capture the consequences of the interaction of compounds on many possible molecular targets and biological pathways, without having any constraints concerning the chemical space. In this work, we assessed the value of a large public dataset of compound-induced transcriptomic data, to predict compound activity on a selection of 69 molecular targets. We compared such descriptors with other QSAR descriptors, namely the Morgan fingerprints (similar to extended-connectivity fingerprints). Depending on the target, active compounds could show similar signatures in one or multiple cell lines, whether these active compounds shared similar or different chemical structures. Random forest models using gene expression signatures were able to perform similarly or better than counterpart models built with Morgan fingerprints for 25% of the target prediction tasks. These performances occurred mostly using signatures produced in cell lines showing similar signatures for active compounds toward the considered target. We show that compound-induced transcriptomic data could represent a great opportunity for target prediction, allowing to overcome the chemical space limitation of QSAR models.
Collapse
Affiliation(s)
| | - Joerg Wichard
- Department of Genetic Toxicology, Bayer AG, Berlin, Germany
| | - Oscar Méndez-Lucio
- Bayer SAS, Bayer CropScience, Sophia Antipolis, France.,Bloomoon, Villeurbanne, France
| | - David Rouquié
- Bayer SAS, Bayer CropScience, Sophia Antipolis, France
| |
Collapse
|
5
|
Caracausi M, Piovesan A, Antonaros F, Strippoli P, Vitale L, Pelleri MC. Systematic identification of human housekeeping genes possibly useful as references in gene expression studies. Mol Med Rep 2017; 16:2397-2410. [PMID: 28713914 PMCID: PMC5548050 DOI: 10.3892/mmr.2017.6944] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 03/16/2017] [Indexed: 12/21/2022] Open
Abstract
The ideal reference, or control, gene for the study of gene expression in a given organism should be expressed at a medium-high level for easy detection, should be expressed at a constant/stable level throughout different cell types and within the same cell type undergoing different treatments, and should maintain these features through as many different tissues of the organism. From a biological point of view, these theoretical requirements of an ideal reference gene appear to be best suited to housekeeping (HK) genes. Recent advancements in the quality and completeness of human expression microarray data and in their statistical analysis may provide new clues toward the quantitative standardization of human gene expression studies in biology and medicine, both cross- and within-tissue. The systematic approach used by the present study is based on the Transcriptome Mapper tool and exploits the automated reassignment of probes to corresponding genes, intra- and inter-sample normalization, elaboration and representation of gene expression values in linear form within an indexed and searchable database with a graphical interface recording quantitative levels of expression, expression variability and cross-tissue width of expression for more than 31,000 transcripts. The present study conducted a meta-analysis of a pool of 646 expression profile data sets from 54 different human tissues and identified actin γ 1 as the HK gene that best fits the combination of all the traditional criteria to be used as a reference gene for general use; two ribosomal protein genes, RPS18 and RPS27, and one aquaporin gene, POM121 transmembrane nucleporin C, were also identified. The present study provided a list of tissue- and organ-specific genes that may be most suited for the following individual tissues/organs: Adipose tissue, bone marrow, brain, heart, kidney, liver, lung, ovary, skeletal muscle and testis; and also provides in these cases a representative, quantitative portrait of the relative, typical gene-expression profile in the form of searchable database tables.
Collapse
Affiliation(s)
- Maria Caracausi
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Allison Piovesan
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Francesca Antonaros
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Pierluigi Strippoli
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Lorenza Vitale
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Maria Chiara Pelleri
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| |
Collapse
|
6
|
Mao H, Chen K, Zhu X, Luo Q, Zhao J, Li W, Wu X, Xu H. Identification of suitable reference genes for quantitative real-time PCR normalization in blotched snakehead Channa maculata. JOURNAL OF FISH BIOLOGY 2017; 90:2312-2322. [PMID: 28386932 DOI: 10.1111/jfb.13308] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Accepted: 03/06/2017] [Indexed: 06/07/2023]
Abstract
A systematic study was conducted to identify reliable reference genes for normalization of gene expression analysis in the blotched snakehead Channa maculata under normal physiological conditions. Firstly, the partial complementary (c)DNA of nine candidate reference genes (actb, tmem104, ube2l3, ef1α, churc1, tmem256, rpl13a, sep15 and g6pd) were cloned from C. maculata. The expression levels of these genes were then assessed in embryos of different developmental stages and various tissue types of adult fish using quantitative real-time (qrt-)PCR. RefFinder algorithm was used to evaluate the expression stability of these genes based on their cycle-threshold (Ct ) values in the qrt-PCR analysis. Results showed that there was no single best reference gene for all stages of embryos and adult tissues tested. Furthermore, it was found that, among the nine candidate genes tested, actb and tmem104 were the most stable reference genes across adult tissue types, while sep15 and tmem256 were the most stable ones across developmental stages of embryos. These stable reference genes are recommended for normalization of gene expression analysis in C. maculata.
Collapse
Affiliation(s)
- H Mao
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
- College of Fisheries and Life Science, Shanghai Ocean University, 999 Huchenghuan Road, Shanghai, 201306, China
| | - K Chen
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
- College of Fisheries and Life Science, Shanghai Ocean University, 999 Huchenghuan Road, Shanghai, 201306, China
| | - X Zhu
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
| | - Q Luo
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
| | - J Zhao
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
| | - W Li
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
| | - X Wu
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
- College of Fisheries and Life Science, Shanghai Ocean University, 999 Huchenghuan Road, Shanghai, 201306, China
| | - H Xu
- Pearl River Fisheries Research Institute, Chinese Academic of Fisheries Sciences, Guangzhou, 510380, China
- College of Fisheries and Life Science, Shanghai Ocean University, 999 Huchenghuan Road, Shanghai, 201306, China
| |
Collapse
|
7
|
BIG Data Center Members. The BIG Data Center: from deposition to integration to translation. Nucleic Acids Res 2016; 45:D18-D24. [PMID: 27899658 PMCID: PMC5210546 DOI: 10.1093/nar/gkw1060] [Citation(s) in RCA: 438] [Impact Index Per Article: 48.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 10/19/2016] [Accepted: 10/21/2016] [Indexed: 02/06/2023] Open
Abstract
Biological data are generated at unprecedentedly exponential rates, posing considerable challenges in big data deposition, integration and translation. The BIG Data Center, established at Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, provides a suite of database resources, including (i) Genome Sequence Archive, a data repository specialized for archiving raw sequence reads, (ii) Gene Expression Nebulas, a data portal of gene expression profiles based entirely on RNA-Seq data, (iii) Genome Variation Map, a comprehensive collection of genome variations for featured species, (iv) Genome Warehouse, a centralized resource housing genome-scale data with particular focus on economically important animals and plants, (v) Methylation Bank, an integrated database of whole-genome single-base resolution methylomes and (vi) Science Wikis, a central access point for biological wikis developed for community annotations. The BIG Data Center is dedicated to constructing and maintaining biological databases through big data integration and value-added curation, conducting basic research to translate big data into big knowledge and providing freely open access to a variety of data resources in support of worldwide research activities in both academia and industry. All of these resources are publicly available and can be found at http://bigd.big.ac.cn.
Collapse
Affiliation(s)
- BIG Data Center Members
- To whom correspondence should be addressed Zhang Zhang. Tel: +86 10 8409 7261; Fax: +86 10 8409 7720;
| |
Collapse
|
8
|
Identifying reproducible cancer-associated highly expressed genes with important functional significances using multiple datasets. Sci Rep 2016; 6:36227. [PMID: 27796338 PMCID: PMC5086981 DOI: 10.1038/srep36227] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Accepted: 10/12/2016] [Indexed: 01/08/2023] Open
Abstract
Identifying differentially expressed (DE) genes between cancer and normal tissues is of basic importance for studying cancer mechanisms. However, current methods, such as the commonly used Significance Analysis of Microarrays (SAM), are biased to genes with low expression levels. Recently, we proposed an algorithm, named the pairwise difference (PD) algorithm, to identify highly expressed DE genes based on reproducibility evaluation of top-ranked expression differences between paired technical replicates of cells under two experimental conditions. In this study, we extended the application of the algorithm to the identification of DE genes between two types of tissue samples (biological replicates) based on several independent datasets or sub-datasets of a dataset, by constructing multiple paired average gene expression profiles for the two types of samples. Using multiple datasets for lung and esophageal cancers, we demonstrated that PD could identify many DE genes highly expressed in both cancer and normal tissues that tended to be missed by the commonly used SAM. These highly expressed DE genes, including many housekeeping genes, were significantly enriched in many conservative pathways, such as ribosome, proteasome, phagosome and TNF signaling pathways with important functional significances in oncogenesis.
Collapse
|
9
|
Xu H, Li C, Zeng Q, Agrawal I, Zhu X, Gong Z. Genome-wide identification of suitable zebrafish Danio rerio reference genes for normalization of gene expression data by RT-qPCR. JOURNAL OF FISH BIOLOGY 2016; 88:2095-2110. [PMID: 27126589 DOI: 10.1111/jfb.12915] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2015] [Accepted: 01/18/2016] [Indexed: 06/05/2023]
Abstract
In this study, to systematically identify the most stably expressed genes for internal reference in zebrafish Danio rerio investigations, 37 D. rerio transcriptomic datasets (both RNA sequencing and microarray data) were collected from gene expression omnibus (GEO) database and unpublished data, and gene expression variations were analysed under three experimental conditions: tissue types, developmental stages and chemical treatments. Forty-four putative candidate genes were identified with the c.v. <0·2 from all datasets. Following clustering into different functional groups, 21 genes, in addition to four conventional housekeeping genes (eef1a1l1, b2m, hrpt1l and actb1), were selected from different functional groups for further quantitative real-time (qrt-)PCR validation using 25 RNA samples from different adult tissues, developmental stages and chemical treatments. The qrt-PCR data were then analysed using the statistical algorithm refFinder for gene expression stability. Several new candidate genes showed better expression stability than the conventional housekeeping genes in all three categories. It was found that sep15 and metap1 were the top two stable genes for tissue types, ube2a and tmem50a the top two for different developmental stages, and rpl13a and rp1p0 the top two for chemical treatments. Thus, based on the extensive transcriptomic analyses and qrt-PCR validation, these new reference genes are recommended for normalization of D. rerio qrt-PCR data respectively for the three different experimental conditions.
Collapse
Affiliation(s)
- H Xu
- Pearl River Fisheries Research Institute, Chinese Academic of Fishery Sciences, Guangzhou 510380, China
- Department of Biological Sciences, National University of Singapore, Singapore 117543, Singapore
| | - C Li
- Department of Biological Sciences, National University of Singapore, Singapore 117543, Singapore
| | - Q Zeng
- Department of Biological Sciences, National University of Singapore, Singapore 117543, Singapore
| | - I Agrawal
- Department of Biological Sciences, National University of Singapore, Singapore 117543, Singapore
| | - X Zhu
- Pearl River Fisheries Research Institute, Chinese Academic of Fishery Sciences, Guangzhou 510380, China
| | - Z Gong
- Department of Biological Sciences, National University of Singapore, Singapore 117543, Singapore
| |
Collapse
|
10
|
Sheng X, Wu J, Sun Q, Li X, Xian F, Sun M, Fang W, Chen M, Yu J, Xiao J. MTD: a mammalian transcriptomic database to explore gene expression and regulation. Brief Bioinform 2016; 18:28-36. [PMID: 26822098 PMCID: PMC5221423 DOI: 10.1093/bib/bbv117] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 12/14/2015] [Indexed: 02/01/2023] Open
Abstract
A systematic transcriptome survey is essential for the characterization and comprehension of the molecular basis underlying phenotypic variations. Recently developed RNA-seq methodology has facilitated efficient data acquisition and information mining of transcriptomes in multiple tissues/cell lines. Current mammalian transcriptomic databases are either tissue-specific or species-specific, and they lack in-depth comparative features across tissues and species. Here, we present a mammalian transcriptomic database (MTD) that is focused on mammalian transcriptomes, and the current version contains data from humans, mice, rats and pigs. Regarding the core features, the MTD browses genes based on their neighboring genomic coordinates or joint KEGG pathway and provides expression information on exons, transcripts and genes by integrating them into a genome browser. We developed a novel nomenclature for each transcript that considers its genomic position and transcriptional features. The MTD allows a flexible search of genes or isoforms with user-defined transcriptional characteristics and provides both table-based descriptions and associated visualizations. To elucidate the dynamics of gene expression regulation, the MTD also enables comparative transcriptomic analysis in both intraspecies and interspecies manner. The MTD thus constitutes a valuable resource for transcriptomic and evolutionary studies. The MTD is freely accessible at http://mtd.cbi.ac.cn.
Collapse
Affiliation(s)
- Xin Sheng
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Corresponding author. Jingfa Xiao, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. Tel.: +86-10-84097443; E-mail: ; Jun Yu, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. Tel.: +86-10-84097898; E-mail:
| | - Jiayan Wu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Qianqian Sun
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Xue Li
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Feng Xian
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Manman Sun
- Hunan Agricultural University, Hunan, China
| | - Wan Fang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Meili Chen
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Jun Yu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Corresponding author. Jingfa Xiao, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. Tel.: +86-10-84097443; E-mail: ; Jun Yu, CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China. Tel.: +86-10-84097898; E-mail:
| | | |
Collapse
|
11
|
Identification and analysis of house-keeping and tissue-specific genes based on RNA-seq data sets across 15 mouse tissues. Gene 2015; 576:560-70. [PMID: 26551299 DOI: 10.1016/j.gene.2015.11.003] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2015] [Revised: 10/27/2015] [Accepted: 11/03/2015] [Indexed: 12/31/2022]
Abstract
Recently, RNA-seq has become widely used technology for transcriptome profiling due to its single-base accuracy and high-throughput speciality. In this study, we applied a computational approach on an integrated RNA-seq dataset across 15 normal mouse tissues, and consequently assigned 8408 house-keeping (HK) genes and 2581 tissue-specific (TS) genes among UCSC RefGene annotation. Apart from some basic genomic features, we also performed expression, function and pathway analysis with clustering, DAVID and Ingenuity Pathway Analysis, indicating the physiological connections (tissues) and diverse biological roles of HK genes (fundamental processes) and TS genes (tissue-corresponding processes). Moreover, we used RT-PCR method to test 18 candidate HK genes and finally identified a novel list of highly stable internal control genes: Ywhae, Ddb 1, Eif4h, etc. In summary, this study provides a new HK gene and TS gene resource for further genetic and evolution research and helps us better understand morphogenesis and biological diversity in mouse.
Collapse
|
12
|
Abstract
The searching of human housekeeping (HK) genes has been a long quest since the emergence of transcriptomics, and is instrumental for us to understand the structure of genome and the fundamentals of biological processes. The resolved genes are frequently used in evolution studies and as normalization standards in quantitative gene-expression analysis. Within the past 20 years, more than a dozen HK-gene studies have been conducted, yet none of them sampled human tissues completely. We believe an integration of these results will help remove false positive genes owing to the inadequate sampling. Surprisingly, we only find one common gene across 15 examined HK-gene datasets comprising 187 different tissue and cell types. Our subsequent analyses suggest that it might not be appropriate to rigidly define HK genes as expressed in all tissue types that have diverse developmental, physiological, and pathological states. It might be beneficial to use more robustly identified HK functions for filtering criteria, in which the representing genes can be a subset of genome. These genes are not necessarily the same, and perhaps need not to be the same, everywhere in our body.
Collapse
Affiliation(s)
- Yijuan Zhang
- Department of Chemistry and Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Ding Li
- Department of Chemistry and Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Bingyun Sun
- Department of Chemistry and Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| |
Collapse
|
13
|
Tian S, Wang C, An MW. Test on existence of histology subtype-specific prognostic signatures among early stage lung adenocarcinoma and squamous cell carcinoma patients using a Cox-model based filter. Biol Direct 2015; 10:15. [PMID: 25887039 PMCID: PMC4415297 DOI: 10.1186/s13062-015-0051-z] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Accepted: 03/24/2015] [Indexed: 01/18/2023] Open
Abstract
Background Non-small cell lung cancer (NSCLC) is the predominant histological type of lung cancer, accounting for up to 85% of cases. Disease stage is commonly used to determine adjuvant treatment eligibility of NSCLC patients, however, it is an imprecise predictor of the prognosis of an individual patient. Currently, many researchers resort to microarray technology for identifying relevant genetic prognostic markers, with particular attention on trimming or extending a Cox regression model. Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are two major histology subtypes of NSCLC. It has been demonstrated that fundamental differences exist in their underlying mechanisms, which motivated us to postulate the existence of specific genes related to the prognosis of each histology subtype. Results In this article, we propose a simple filter feature selection algorithm with a Cox regression model as the base. Applying this method to real-world microarray data identifies a histology-specific prognostic gene signature. Furthermore, the resulting 32-gene (32/12 for AC/SCC) prognostic signature for early-stage AC and SCC samples has superior predictive ability relative to two relevant prognostic signatures, and has comparable performance with signatures obtained by applying two state-of-the art algorithms separately to AC and SCC samples. Conclusions Our proposal is conceptually simple, and straightforward to implement. Furthermore, it can be easily adapted and applied to a range of other research settings. Reviewers This article was reviewed by Leonid Hanin (nominated by Dr. Lev Klebanov), Limsoon Wong and Jun Yu. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0051-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Epidemiology, First Hospital of Jilin University, 71Xinmin Street, Changchun, Jilin, 130021, China.
| | - Chi Wang
- Department of Biostatistics and Markey Cancer Center, University of Kentucky, 800 Rose St., Lexington, KY, 40536, USA.
| | - Ming-Wen An
- Department of Mathematics, Vassar College, Poughkeepsie, NY, 12604, USA.
| |
Collapse
|
14
|
Ersahin T, Carkacioglu L, Can T, Konu O, Atalay V, Cetin-Atalay R. Identification of novel reference genes based on MeSH categories. PLoS One 2014; 9:e93341. [PMID: 24682035 PMCID: PMC3969360 DOI: 10.1371/journal.pone.0093341] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2013] [Accepted: 03/03/2014] [Indexed: 11/18/2022] Open
Abstract
Transcriptome experiments are performed to assess protein abundance through mRNA expression analysis. Expression levels of genes vary depending on the experimental conditions and the cell response. Transcriptome data must be diverse and yet comparable in reference to stably expressed genes, even if they are generated from different experiments on the same biological context from various laboratories. In this study, expression patterns of 9090 microarray samples grouped into 381 NCBI-GEO datasets were investigated to identify novel candidate reference genes using randomizations and Receiver Operating Characteristic (ROC) curves. The analysis demonstrated that cell type specific reference gene sets display less variability than a united set for all tissues. Therefore, constitutively and stably expressed, origin specific novel reference gene sets were identified based on their coefficient of variation and percentage of occurrence in all GEO datasets, which were classified using Medical Subject Headings (MeSH). A large number of MeSH grouped reference gene lists are presented as novel tissue specific reference gene lists. The most commonly observed 17 genes in these sets were compared for their expression in 8 hepatocellular, 5 breast and 3 colon carcinoma cells by RT-qPCR to verify tissue specificity. Indeed, commonly used housekeeping genes GAPDH, Actin and EEF2 had tissue specific variations, whereas several ribosomal genes were among the most stably expressed genes in vitro. Our results confirm that two or more reference genes should be used in combination for differential expression analysis of large-scale data obtained from microarray or next generation sequencing studies. Therefore context dependent reference gene sets, as presented in this study, are required for normalization of expression data from diverse technological backgrounds.
Collapse
Affiliation(s)
- Tulin Ersahin
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Turkey
| | - Levent Carkacioglu
- Computer Engineering Department, Middle East Technical University, Ankara, Turkey
| | - Tolga Can
- Computer Engineering Department, Middle East Technical University, Ankara, Turkey
| | - Ozlen Konu
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Turkey
| | - Volkan Atalay
- Computer Engineering Department, Middle East Technical University, Ankara, Turkey
| | - Rengul Cetin-Atalay
- Department of Molecular Biology and Genetics, Bilkent University, Ankara, Turkey
| |
Collapse
|
15
|
Wei L, Lian B, Zhang Y, Li W, Gu J, He X, Xie L. Application of microRNA and mRNA expression profiling on prognostic biomarker discovery for hepatocellular carcinoma. BMC Genomics 2014; 15 Suppl 1:S13. [PMID: 24564407 PMCID: PMC4046763 DOI: 10.1186/1471-2164-15-s1-s13] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Background Hepatocellular carcinoma (HCC) is one of the most highly malignant and lethal cancers of the world. Its pathogenesis has been reported to be multi-factorial, and the molecular carcinogenesis of HCC can not be attributed to just a few individual genes. Based on the microRNA and mRNA expression profiling of normal liver tissues, pericancerous hepatocellular tissues and hepatocellular carcinoma tissues, we attempted to find prognosis related gene sets for HCC patients. Results We identified differentially expressed genes (DEG) from three comparisons: Cancer/Normal, Cancer/Pericancerous and Pericancerous/Normal. GSEA (gene set enrichment analysis) were performed. Based on the enriched gene sets of GO terms, pathways and transcription factor targets, it was found that the genome instability and cell proliferation increased while the metabolism and differentiation decreased in HCC tissues. The expression profile of DEGs in each enriched gene set was used to correlate to the postoperative survival time of HCC patients. Nine gene sets were found to prognostic correlation. Furthermore, after substituting DEG-targeting-microRNA for DEG members of each gene set, two gene sets with the microRNA expression profiles were obtained that had prognostic potential. Conclusions The malignancy of HCC could be represented by gene sets, and pericancerous liver exhibits important characteristics of liver cancer. The expression level of gene sets not only in HCC but also in the pericancerous liver showed potential for prognosis implying an option for HCC prognosis at an early stage. Additionally, the gene-targeting-microRNA expression profiles also showed prognostic potential, demonstrating that the multi-factorial molecular pathogenesis of HCC is contributed by various genes and microRNAs. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-S1-S13) contains supplementary material, which is available to authorized users.
Collapse
|
16
|
Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet 2013; 29:569-74. [PMID: 23810203 DOI: 10.1016/j.tig.2013.05.010] [Citation(s) in RCA: 865] [Impact Index Per Article: 72.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2013] [Revised: 05/06/2013] [Accepted: 05/30/2013] [Indexed: 10/26/2022]
Abstract
Housekeeping genes are involved in basic cell maintenance and, therefore, are expected to maintain constant expression levels in all cells and conditions. Identification of these genes facilitates exposure of the underlying cellular infrastructure and increases understanding of various structural genomic features. In addition, housekeeping genes are instrumental for calibration in many biotechnological applications and genomic studies. Advances in our ability to measure RNA expression have resulted in a gradual increase in the number of identified housekeeping genes. Here, we describe housekeeping gene detection in the era of massive parallel sequencing and RNA-seq. We emphasize the importance of expression at a constant level and provide a list of 3804 human genes that are expressed uniformly across a panel of tissues. Several exceptionally uniform genes are singled out for future experimental use, such as RT-PCR control genes. Finally, we discuss both ways in which current technology can meet some of past obstacles encountered, and several as yet unmet challenges.
Collapse
Affiliation(s)
- Eli Eisenberg
- Raymond and Beverly Sackler School of Physics and Astronomy, Tel-Aviv University, Tel Aviv 69978, Israel.
| | | |
Collapse
|