1
|
Advances in AI and machine learning for predictive medicine. J Hum Genet 2024:10.1038/s10038-024-01231-y. [PMID: 38424184 DOI: 10.1038/s10038-024-01231-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 02/04/2024] [Accepted: 02/12/2024] [Indexed: 03/02/2024]
Abstract
The field of omics, driven by advances in high-throughput sequencing, faces a data explosion. This abundance of data offers unprecedented opportunities for predictive modeling in precision medicine, but also presents formidable challenges in data analysis and interpretation. Traditional machine learning (ML) techniques have been partly successful in generating predictive models for omics analysis but exhibit limitations in handling potential relationships within the data for more accurate prediction. This review explores a revolutionary shift in predictive modeling through the application of deep learning (DL), specifically convolutional neural networks (CNNs). Using transformation methods such as DeepInsight, omics data with independent variables in tabular (table-like, including vector) form can be turned into image-like representations, enabling CNNs to capture latent features effectively. This approach not only enhances predictive power but also leverages transfer learning, reducing computational time, and improving performance. However, integrating CNNs in predictive omics data analysis is not without challenges, including issues related to model interpretability, data heterogeneity, and data size. Addressing these challenges requires a multidisciplinary approach, involving collaborations between ML experts, bioinformatics researchers, biologists, and medical doctors. This review illuminates these complexities and charts a course for future research to unlock the full predictive potential of CNNs in omics data analysis and related fields.
Collapse
|
2
|
scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning. Brief Bioinform 2023; 24:bbad266. [PMID: 37523217 PMCID: PMC10516353 DOI: 10.1093/bib/bbad266] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Revised: 06/12/2023] [Accepted: 07/04/2023] [Indexed: 08/01/2023] Open
Abstract
Annotation of cell-types is a critical step in the analysis of single-cell RNA sequencing (scRNA-seq) data that allows the study of heterogeneity across multiple cell populations. Currently, this is most commonly done using unsupervised clustering algorithms, which project single-cell expression data into a lower dimensional space and then cluster cells based on their distances from each other. However, as these methods do not use reference datasets, they can only achieve a rough classification of cell-types, and it is difficult to improve the recognition accuracy further. To effectively solve this issue, we propose a novel supervised annotation method, scDeepInsight. The scDeepInsight method is capable of performing manifold assignments. It is competent in executing data integration through batch normalization, performing supervised training on the reference dataset, doing outlier detection and annotating cell-types on query datasets. Moreover, it can help identify active genes or marker genes related to cell-types. The training of the scDeepInsight model is performed in a unique way. Tabular scRNA-seq data are first converted to corresponding images through the DeepInsight methodology. DeepInsight can create a trainable image transformer to convert non-image RNA data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, the converted images are fed into convolutional neural networks such as EfficientNet-b3. This enables automatic feature extraction to identify the cell-types of scRNA-seq samples. We benchmarked scDeepInsight with six other mainstream cell annotation methods. The average accuracy rate of scDeepInsight reached 87.5%, which is more than 7% higher compared with the state-of-the-art methods.
Collapse
|
3
|
Topologically associating domain underlies tissue specific expression of long intergenic non-coding RNAs. iScience 2023; 26:106640. [PMID: 37250307 PMCID: PMC10214471 DOI: 10.1016/j.isci.2023.106640] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 12/15/2022] [Accepted: 04/05/2023] [Indexed: 05/31/2023] Open
Abstract
Accumulating evidence indicates that long intergenic non-coding RNAs (lincRNAs) show more tissue-specific expression patterns than protein-coding genes (PCGs). However, although lincRNAs are subject to canonical transcriptional regulation like PCGs, the molecular basis for the specificity of their expression patterns remains unclear. Here, using expression data and coordinates of topologically associating domains (TADs) in human tissues, we show that lincRNA loci are significantly enriched in the more internal region of TADs compared to PCGs and that lincRNAs within TADs have higher tissue specificity than those outside TADs. Based on these, we propose an analytical framework to interpret transcriptional status using lincRNA as an indicator. We applied it to hypertrophic cardiomyopathy data and found disease-specific transcriptional regulation: ectopic expression of keratin at the TAD level and derepression of myocyte differentiation-related genes by E2F1 with down-regulation of LINC00881. Our results provide understanding of the function and regulation of lincRNAs according to genomic structure.
Collapse
|
4
|
DeepInsight-3D architecture for anti-cancer drug response prediction with deep-learning on multi-omics. Sci Rep 2023; 13:2483. [PMID: 36774402 PMCID: PMC9922304 DOI: 10.1038/s41598-023-29644-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 02/08/2023] [Indexed: 02/13/2023] Open
Abstract
Modern oncology offers a wide range of treatments and therefore choosing the best option for particular patient is very important for optimal outcome. Multi-omics profiling in combination with AI-based predictive models have great potential for streamlining these treatment decisions. However, these encouraging developments continue to be hampered by very high dimensionality of the datasets in combination with insufficiently large numbers of annotated samples. Here we proposed a novel deep learning-based method to predict patient-specific anticancer drug response from three types of multi-omics data. The proposed DeepInsight-3D approach relies on structured data-to-image conversion that then allows use of convolutional neural networks, which are particularly robust to high dimensionality of the inputs while retaining capabilities to model highly complex relationships between variables. Of particular note, we demonstrate that in this formalism additional channels of an image can be effectively used to accommodate data from different omics layers while implicitly encoding the connection between them. DeepInsight-3D was able to outperform other state-of-the-art methods applied to this task. The proposed improvements can facilitate the development of better personalized treatment strategies for different cancers in the future.
Collapse
|
5
|
|
6
|
Author Correction: Genomic basis for RNA alterations in cancer. Nature 2023; 614:E37. [PMID: 36697831 PMCID: PMC9931574 DOI: 10.1038/s41586-022-05596-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
7
|
Association between high immune activity and worse prognosis in uveal melanoma and low-grade glioma in TCGA transcriptomic data. BMC Genomics 2022; 23:351. [PMID: 35525921 PMCID: PMC9078026 DOI: 10.1186/s12864-022-08586-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 04/25/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Immune status in the tumor microenvironment is an important determinant of cancer progression and patient prognosis. Although a higher immune activity is often associated with a better prognosis, this trend is not absolute and differs across cancer types. We aimed to give insights into why some cancers do not show better survival despite higher immunity by assessing the relationship between different biological factors, including cytotoxicity, and patient prognosis in various cancer types using RNA-seq data collected by The Cancer Genome Atlas. RESULTS Results showed that a higher immune activity was associated with worse overall survival in patients with uveal melanoma and low-grade glioma, which are cancers of immune-privileged sites. In these cancers, epithelial or endothelial mesenchymal transition and inflammatory state as well as immune activation had a notable negative correlation with patient survival. Further analysis using additional single-cell data of uveal melanoma and glioma revealed that epithelial or endothelial mesenchymal transition was mainly induced in retinal pigment cells or endothelial cells that comprise the blood-retinal and blood-brain barriers, which are unique structures of the eye and central nervous system, respectively. Inflammation was mainly promoted by macrophages, and their infiltration increased significantly in response to immune activation. Furthermore, we found the expression of inflammatory chemokines, particularly CCL5, was strongly correlated with immune activity and associated with poor survival, particularly in these cancers, suggesting that these inflammatory mediators are potential molecular targets for therapeutics. CONCLUSIONS In uveal melanoma and low-grade glioma, inflammation from macrophages and epithelial or endothelial mesenchymal transition are particularly associated with a poor prognosis. This implies that they loosen the structures of the blood barrier and impair homeostasis and further recruit immune cells, which could result in a feedback loop of additional inflammatory effects leading to runaway conditions.
Collapse
|
8
|
Immune subtypes and neoantigen-related immune evasion in advanced colorectal cancer. iScience 2022; 25:103740. [PMID: 35128352 PMCID: PMC8800070 DOI: 10.1016/j.isci.2022.103740] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 08/03/2021] [Accepted: 01/04/2022] [Indexed: 01/09/2023] Open
Abstract
Elimination of cancerous cells by the immune system is an important mechanism of protection from cancer, however, its effectiveness can be reduced owing to development of resistance and evasion. To understand the systemic immune response in advanced untreated primary colorectal cancer, we analyze immune subtypes and immune evasion via neoantigen-related mechanisms. We identify a distinctive cancer subtype characterized by immune evasion and very poor overall survival. This subtype has less clonal highly expressed neoantigens and high chromosomal instability, resulting in adaptive immune resistance mediated by the immune checkpoint molecules and neoantigen presentation disorders. We also observe that neoantigen depletion caused by immunoediting and high clonal neoantigen load are correlated with a good overall survival. Our results indicate that the status of the tumor microenvironment and neoantigen composition are promising new prognostic biomarkers with potential relevance for treatment plan decisions in advanced CRC.
Collapse
|
9
|
DeepFeature: feature selection in nonimage data using convolutional neural network. Brief Bioinform 2021; 22:6343526. [PMID: 34368836 PMCID: PMC8575039 DOI: 10.1093/bib/bbab297] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Revised: 06/30/2021] [Accepted: 07/14/2021] [Indexed: 12/14/2022] Open
Abstract
Artificial intelligence methods offer exciting new capabilities for the discovery of biological mechanisms from raw data because they are able to detect vastly more complex patterns of association that cannot be captured by classical statistical tests. Among these methods, deep neural networks are currently among the most advanced approaches and, in particular, convolutional neural networks (CNNs) have been shown to perform excellently for a variety of difficult tasks. Despite that applications of this type of networks to high-dimensional omics data and, most importantly, meaningful interpretation of the results returned from such models in a biomedical context remains an open problem. Here we present, an approach applying a CNN to nonimage data for feature selection. Our pipeline, DeepFeature, can both successfully transform omics data into a form that is optimal for fitting a CNN model and can also return sets of the most important genes used internally for computing predictions. Within the framework, the Snowfall compression algorithm is introduced to enable more elements in the fixed pixel framework, and region accumulation and element decoder is developed to find elements or genes from the class activation maps. In comparative tests for cancer type prediction task, DeepFeature simultaneously achieved superior predictive performance and better ability to discover key pathways and biological processes meaningful for this context. Capabilities offered by the proposed framework can enable the effective use of powerful deep learning methods to facilitate the discovery of causal mechanisms in high-dimensional biomedical data.
Collapse
|
10
|
Prognosis prediction model for conversion from mild cognitive impairment to Alzheimer's disease created by integrative analysis of multi-omics data. ALZHEIMERS RESEARCH & THERAPY 2020; 12:145. [PMID: 33172501 PMCID: PMC7656734 DOI: 10.1186/s13195-020-00716-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Accepted: 10/26/2020] [Indexed: 12/14/2022]
Abstract
BACKGROUND Mild cognitive impairment (MCI) is a precursor to Alzheimer's disease (AD), but not all MCI patients develop AD. Biomarkers for early detection of individuals at high risk for MCI-to-AD conversion are urgently required. METHODS We used blood-based microRNA expression profiles and genomic data of 197 Japanese MCI patients to construct a prognosis prediction model based on a Cox proportional hazard model. We examined the biological significance of our findings with single nucleotide polymorphism-microRNA pairs (miR-eQTLs) by focusing on the target genes of the miRNAs. We investigated functional modules from the target genes with the occurrence of hub genes though a large-scale protein-protein interaction network analysis. We further examined the expression of the genes in 610 blood samples (271 ADs, 248 MCIs, and 91 cognitively normal elderly subjects [CNs]). RESULTS The final prediction model, composed of 24 miR-eQTLs and three clinical factors (age, sex, and APOE4 alleles), successfully classified MCI patients into low and high risk of MCI-to-AD conversion (log-rank test P = 3.44 × 10-4 and achieved a concordance index of 0.702 on an independent test set. Four important hub genes associated with AD pathogenesis (SHC1, FOXO1, GSK3B, and PTEN) were identified in a network-based meta-analysis of miR-eQTL target genes. RNA-seq data from 610 blood samples showed statistically significant differences in PTEN expression between MCI and AD and in SHC1 expression between CN and AD (PTEN, P = 0.023; SHC1, P = 0.049). CONCLUSIONS Our proposed model was demonstrated to be effective in MCI-to-AD conversion prediction. A network-based meta-analysis of miR-eQTL target genes identified important hub genes associated with AD pathogenesis. Accurate prediction of MCI-to-AD conversion would enable earlier intervention for MCI patients at high risk, potentially reducing conversion to AD.
Collapse
|
11
|
Clinical usefulness of multigene screening with phenotype-driven bioinformatics analysis for the diagnosis of patients with monogenic diabetes or severe insulin resistance. Diabetes Res Clin Pract 2020; 169:108461. [PMID: 32971154 DOI: 10.1016/j.diabres.2020.108461] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 08/29/2020] [Accepted: 09/16/2020] [Indexed: 11/29/2022]
Abstract
AIMS Monogenic diabetes is clinically heterogeneous and differs from common forms of diabetes (type 1 and 2). We aimed to investigate the clinical usefulness of a comprehensive genetic testing system, comprised of targeted next-generation sequencing (NGS) with phenotype-driven bioinformatics analysis in patients with monogenic diabetes, which uses patient genotypic and phenotypic data to prioritize potentially causal variants. METHODS We performed targeted NGS of 383 genes associated with monogenic diabetes or common forms of diabetes in 13 Japanese patients with suspected (n = 10) or previously diagnosed (n = 3) monogenic diabetes or severe insulin resistance. We performed in silico structural analysis and phenotype-driven bioinformatics analysis of candidate variants from NGS data. RESULTS Among the patients suspected having monogenic diabetes or insulin resistance, we diagnosed 3 patients as subtypes of monogenic diabetes due to disease-associated variants of INSR, LMNA, and HNF1B. Additionally, in 3 other patients, we detected rare variants with potential phenotypic effects. Notably, we identified a novel missense variant in TBC1D4 and an MC4R variant, which together may cause a mixed phenotype of severe insulin resistance. CONCLUSIONS This comprehensive approach could assist in the early diagnosis of patients with monogenic diabetes and facilitate the provision of tailored therapy.
Collapse
|
12
|
Quantification of multicellular colonization in tumor metastasis using exome-sequencing data. Int J Cancer 2020; 146:2488-2497. [PMID: 32020592 PMCID: PMC7079087 DOI: 10.1002/ijc.32910] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Revised: 12/18/2019] [Accepted: 01/14/2020] [Indexed: 11/10/2022]
Abstract
Metastasis is a major cause of cancer-related mortality, and it is essential to understand how metastasis occurs in order to overcome it. One relevant question is the origin of a metastatic tumor cell population. Although the hypothesis of a single-cell origin for metastasis from a primary tumor has long been prevalent, several recent studies using mouse models have supported a multicellular origin of metastasis. Human bulk whole-exome sequencing (WES) studies also have demonstrated a multiple "clonal" origin of metastasis, with different mutational compositions. Specifically, there has not yet been strong research to determine how many founder cells colonize a metastatic tumor. To address this question, under the metastatic model of "single bottleneck followed by rapid growth," we developed a method to quantify the "founder cell population size" in a metastasis using paired WES data from primary and metachronous metastatic tumors. Simulation studies demonstrated the proposed method gives unbiased results with sufficient accuracy in the range of realistic settings. Applying the proposed method to real WES data from four colorectal cancer patients, all samples supported a multicellular origin of metastasis and the founder size was quantified, ranging from 3 to 17 cells. Such a wide-range of founder sizes estimated by the proposed method suggests that there are large variations in genetic similarity between primary and metastatic tumors in the same subjects, which may explain the observed (dis)similarity of drug responses between tumors.
Collapse
|
13
|
Abstract
Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale1-3. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4-5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter4; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation5,6; analyses timings and patterns of tumour evolution7; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity8,9; and evaluates a range of more-specialized features of cancer genomes8,10-18.
Collapse
|
14
|
Abstract
The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.
Collapse
|
15
|
A comparison of machine learning classifiers for dementia with Lewy bodies using miRNA expression data. BMC Med Genomics 2019; 12:150. [PMID: 31666070 PMCID: PMC6822471 DOI: 10.1186/s12920-019-0607-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Accepted: 10/18/2019] [Indexed: 12/21/2022] Open
Abstract
Background Dementia with Lewy bodies (DLB) is the second most common subtype of neurodegenerative dementia in humans following Alzheimer’s disease (AD). Present clinical diagnosis of DLB has high specificity and low sensitivity and finding potential biomarkers of prodromal DLB is still challenging. MicroRNAs (miRNAs) have recently received a lot of attention as a source of novel biomarkers. Methods In this study, using serum miRNA expression of 478 Japanese individuals, we investigated potential miRNA biomarkers and constructed an optimal risk prediction model based on several machine learning methods: penalized regression, random forest, support vector machine, and gradient boosting decision tree. Results The final risk prediction model, constructed via a gradient boosting decision tree using 180 miRNAs and two clinical features, achieved an accuracy of 0.829 on an independent test set. We further predicted candidate target genes from the miRNAs. Gene set enrichment analysis of the miRNA target genes revealed 6 functional genes included in the DHA signaling pathway associated with DLB pathology. Two of them were further supported by gene-based association studies using a large number of single nucleotide polymorphism markers (BCL2L1: P = 0.012, PIK3R2: P = 0.021). Conclusions Our proposed prediction model provides an effective tool for DLB classification. Also, a gene-based association test of rare variants revealed that BCL2L1 and PIK3R2 were statistically significantly associated with DLB.
Collapse
|
16
|
Risk prediction models for dementia constructed by supervised principal component analysis using miRNA expression data. Commun Biol 2019; 2:77. [PMID: 30820472 PMCID: PMC6389908 DOI: 10.1038/s42003-019-0324-7] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2018] [Accepted: 01/24/2019] [Indexed: 02/07/2023] Open
Abstract
Alzheimer's disease (AD) is the most common subtype of dementia, followed by Vascular Dementia (VaD), and Dementia with Lewy Bodies (DLB). Recently, microRNAs (miRNAs) have received a lot of attention as the novel biomarkers for dementia. Here, using serum miRNA expression of 1,601 Japanese individuals, we investigated potential miRNA biomarkers and constructed risk prediction models, based on a supervised principal component analysis (PCA) logistic regression method, according to the subtype of dementia. The final risk prediction model achieved a high accuracy of 0.873 on a validation cohort in AD, when using 78 miRNAs: Accuracy = 0.836 with 86 miRNAs in VaD; Accuracy = 0.825 with 110 miRNAs in DLB. To our knowledge, this is the first report applying miRNA-based risk prediction models to a dementia prospective cohort. Our study demonstrates our models to be effective in prospective disease risk prediction, and with further improvement may contribute to practical clinical use in dementia.
Collapse
|
17
|
An integrative machine learning approach for prediction of toxicity-related drug safety. Life Sci Alliance 2018; 1:e201800098. [PMID: 30515477 PMCID: PMC6262234 DOI: 10.26508/lsa.201800098] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 11/20/2018] [Accepted: 11/20/2018] [Indexed: 01/28/2023] Open
Abstract
Recent trends in drug development have been marked by diminishing returns caused by the escalating costs and falling rates of new drug approval. Unacceptable drug toxicity is a substantial cause of drug failure during clinical trials and the leading cause of drug withdraws after release to the market. Computational methods capable of predicting these failures can reduce the waste of resources and time devoted to the investigation of compounds that ultimately fail. We propose an original machine learning method that leverages identity of drug targets and off-targets, functional impact score computed from Gene Ontology annotations, and biological network data to predict drug toxicity. We demonstrate that our method (TargeTox) can distinguish potentially idiosyncratically toxic drugs from safe drugs and is also suitable for speculative evaluation of different target sets to support the design of optimal low-toxicity combinations.
Collapse
|
18
|
Integrated analysis of human genetic association study and mouse transcriptome suggests LBH and SHF genes as novel susceptible genes for amyloid-β accumulation in Alzheimer's disease. Hum Genet 2018; 137:521-533. [PMID: 30006735 PMCID: PMC6061045 DOI: 10.1007/s00439-018-1906-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2018] [Accepted: 07/06/2018] [Indexed: 12/04/2022]
Abstract
Alzheimer's disease (AD) is a common neurological disease that causes dementia in humans. Although the reports of associated pathological genes have been increasing, the molecular mechanism leading to the accumulation of amyloid-β (Aβ) in human brain is still not well understood. To identify novel genes that cause accumulation of Aβ in AD patients, we conducted an integrative analysis by combining a human genetic association study and transcriptome analysis in mouse brain. First, we examined genome-wide gene expression levels in the hippocampus, comparing them to amyloid Aβ level in mice with mixed genetic backgrounds. Next, based on a GWAS statistics obtained by a previous study with human AD subjects, we obtained gene-based statistics from the SNP-based statistics. We combined p values from the two types of analysis across orthologous gene pairs in human and mouse into one p value for each gene to evaluate AD susceptibility. As a result, we found five genes with significant p values in this integrated analysis among the 373 genes analyzed. We also examined the gene expression level of these five genes in the hippocampus of independent human AD cases and control subjects. Two genes, LBH and SHF, showed lower expression levels in AD cases than control subjects. This is consistent with the gene expression levels of both the genes in mouse which were negatively correlated with Aβ accumulation. These results, obtained from the integrative approach, suggest that LBH and SHF are associated with the AD pathogenesis.
Collapse
|
19
|
Empirical Bayes Estimation of Semi-parametric Hierarchical Mixture Models for Unbiased Characterization of Polygenic Disease Architectures. Front Genet 2018; 9:115. [PMID: 29740473 PMCID: PMC5928254 DOI: 10.3389/fgene.2018.00115] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Accepted: 03/22/2018] [Indexed: 12/29/2022] Open
Abstract
Genome-wide association studies (GWAS) suggest that the genetic architecture of complex diseases consists of unexpectedly numerous variants with small effect sizes. However, the polygenic architectures of many diseases have not been well characterized due to lack of simple and fast methods for unbiased estimation of the underlying proportion of disease-associated variants and their effect-size distribution. Applying empirical Bayes estimation of semi-parametric hierarchical mixture models to GWAS summary statistics, we confirmed that schizophrenia was extremely polygenic [~40% of independent genome-wide SNPs are risk variants, most within odds ratio (OR = 1.03)], whereas rheumatoid arthritis was less polygenic (~4 to 8% risk variants, significant portion reaching OR = 1.05 to 1.1). For rheumatoid arthritis, stratified estimations revealed that expression quantitative loci in blood explained large genetic variance, and low- and high-frequency derived alleles were prone to be risk and protective, respectively, suggesting a predominance of deleterious-risk and advantageous-protective mutations. Despite genetic correlation, effect-size distributions for schizophrenia and bipolar disorder differed across allele frequency. These analyses distinguished disease polygenic architectures and provided clues for etiological differences in complex diseases.
Collapse
|
20
|
Structural Basis and Genotype-Phenotype Correlations of INSR Mutations Causing Severe Insulin Resistance. Diabetes 2017; 66:2713-2723. [PMID: 28765322 DOI: 10.2337/db17-0301] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Accepted: 07/24/2017] [Indexed: 11/13/2022]
Abstract
The insulin receptor (INSR) gene was analyzed in four patients with severe insulin resistance, revealing five novel mutations and a deletion that removed exon 2. A patient with Donohue syndrome (DS) had a novel p.V657F mutation in the second fibronectin type III domain (FnIII-2), which contains the α-β cleavage site and part of the insulin-binding site. The mutant INSR was expressed in Chinese hamster ovary cells, revealing that it reduced insulin proreceptor processing and impaired activation of downstream signaling cascades. Using online databases, we analyzed 82 INSR missense mutations and demonstrated that mutations causing DS were more frequently located in the FnIII domains than those causing the milder type A insulin resistance (P = 0.016). In silico structural analysis revealed that missense mutations predicted to severely impair hydrophobic core formation and stability of the FnIII domains all caused DS, whereas those predicted to produce localized destabilization and to not affect folding of the FnIII domains all caused the less severe Rabson-Mendenhall syndrome. These results suggest the importance of the FnIII domains, provide insight into the molecular mechanism of severe insulin resistance, will aid early diagnosis, and will provide potential novel targets for treating extreme insulin resistance.
Collapse
|
21
|
A novel genetic syndrome with STARD9 mutation and abnormal spindle morphology. Am J Med Genet A 2017; 173:2690-2696. [PMID: 28777490 DOI: 10.1002/ajmg.a.38391] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2017] [Revised: 05/06/2017] [Accepted: 07/14/2017] [Indexed: 11/10/2022]
Abstract
Intellectual disability (ID) is one of neurodevelopmental disorders characterized by serious defects in both intelligence and adaptive behavior. Although it has been suggested that genetic aberrations associated with the process of cell division underlie ID, the cytological evidence for mitotic defects in actual patient's cells is rarely reported. Here, we report a novel mutation in the STARD9 (also known as KIF16A) gene found in a patient with severe ID, characteristic features, epilepsy, acquired microcephaly, and blindness. Using whole-exome sequence analysis, we sequenced potential candidate genes in the patient. We identified a homozygous single-nucleotide deletion creating a premature stop codon in the STARD9 gene. STARD9 encodes a 4,700 amino acid protein belonging to the kinesin superfamily. Depletion of STARD9 or overexpression of C-terminally truncated STARD9 mutants were known to induce spindle assembly defects in human culture cells. To determine cytological features in the patient cells, we isolated lymphoblast cells from the patient, and performed immunofluorescence analysis. Remarkably, mitotic defects, including multipolar spindle formation, fragmentation of pericentriolar materials and centrosome amplification, were observed in the cells. Taken together, our findings raise the possibility that controlled expression of full-length STARD9 is necessary for proper spindle assembly in cell division during human development. We propose that mutations in STARD9 result in abnormal spindle morphology and cause a novel genetic syndrome with ID.
Collapse
|
22
|
The prediction models for postoperative overall survival and disease-free survival in patients with breast cancer. Cancer Med 2017; 6:1627-1638. [PMID: 28544536 PMCID: PMC5504310 DOI: 10.1002/cam4.1092] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Revised: 03/17/2017] [Accepted: 04/12/2017] [Indexed: 12/18/2022] Open
Abstract
The goal of this study is to establish a method for predicting overall survival (OS) and disease‐free survival (DFS) in breast cancer patients after surgical operation. The gene expression profiles of cancer tissues from the patients, who underwent complete surgical resection of breast cancer and were subsequently monitored for postoperative survival, were analyzed using cDNA microarrays. We detected seven and three probes/genes associated with the postoperative OS and DFS, respectively, from our discovery cohort data. By incorporating these genes associated with the postoperative survival into MammaPrint genes, often used to predict prognosis of patients with early‐stage breast cancer, we constructed postoperative OS and DFS prediction models from the discovery cohort data using a Cox proportional hazard model. The predictive ability of the models was evaluated in another independent cohort using Kaplan–Meier (KM) curves and the area under the receiver operating characteristic curve (AUC). The KM curves showed a statistically significant difference between the predicted high‐ and low‐risk groups in both OS (log‐rank trend test P = 0.0033) and DFS (log‐rank trend test P = 0.00030). The models also achieved high AUC scores of 0.71 in OS and of 0.60 in DFS. Furthermore, our models had improved KM curves when compared to the models using MammaPrint genes (OS: P = 0.0058, DFS: P = 0.00054). Similar results were observed when our model was tested in publicly available datasets. These observations indicate that there is still room for improvement in the current methods of predicting postoperative OS and DFS in breast cancer.
Collapse
|
23
|
Stepwise iterative maximum likelihood clustering approach. BMC Bioinformatics 2016; 17:319. [PMID: 27553625 PMCID: PMC4995791 DOI: 10.1186/s12859-016-1184-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 08/12/2016] [Indexed: 11/24/2022] Open
Abstract
Background Biological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze. An abundance of such data in this modern era requires the development of sophisticated statistical methods to analyze it in a reasonable amount of time. In many biological/genetic analyses, such as genome-wide association study (GWAS) analysis or multi-omics data analysis, it is required to cluster the plethora of data into sub-categories to understand the subtypes of populations, cancers or any other diseases. Traditionally, the k-means clustering algorithm is a dominant clustering method. This is due to its simplicity and reasonable level of accuracy. Many other clustering methods, including support vector clustering, have been developed in the past, but do not perform well with the biological data, either due to computational reasons or failure to identify clusters. Results The proposed SIML clustering algorithm has been tested on microarray datasets and SNP datasets. It has been compared with a number of clustering algorithms. On MLL datasets, SIML achieved highest clustering accuracy and rand score on 4/9 cases; similarly on SRBCT dataset, it got for 3/5 cases; on ALL subtype it got highest clustering accuracy for 5/7 cases and highest rand score for 4/7 cases. In addition, SIML overall clustering accuracy on a 3 cluster problem using SNP data were 97.3, 94.7 and 100 %, respectively, for each of the clusters. Conclusions In this paper, considering the nature of biological data, we proposed a maximum likelihood clustering approach using a stepwise iterative procedure. The advantage of this proposed method is that it not only uses the distance information, but also incorporate variance information for clustering. This method is able to cluster when data appeared in overlapping and complex forms. The experimental results illustrate its performance and usefulness over other clustering methods. A Matlab package of this method (SIML) is provided at the web-link http://www.riken.jp/en/research/labs/ims/med_sci_math/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1184-5) contains supplementary material, which is available to authorized users.
Collapse
|
24
|
Erratum: Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat Genet 2016; 48:700. [PMID: 27230686 DOI: 10.1038/ng0616-700a] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
25
|
Abstract
OBJECTIVE In this paper, we focused on developing a clustering approach for biological data. In many biological analyses, such as multiomics data analysis and genome-wide association studies analysis, it is crucial to find groups of data belonging to subtypes of diseases or tumors. METHODS Conventionally, the k-means clustering algorithm is overwhelmingly applied in many areas including biological sciences. There are, however, several alternative clustering algorithms that can be applied, including support vector clustering. In this paper, taking into consideration the nature of biological data, we propose a maximum likelihood clustering scheme based on a hierarchical framework. RESULTS This method can perform clustering even when the data belonging to different groups overlap. It can also perform clustering when the number of samples is lower than the data dimensionality. CONCLUSION The proposed scheme is free from selecting initial settings to begin the search process. In addition, it does not require the computation of the first and second derivative of likelihood functions, as is required by many other maximum likelihood-based methods. SIGNIFICANCE This algorithm uses distribution and centroid information to cluster a sample and was applied to biological data. A MATLAB implementation of this method can be downloaded from the web link http://www.riken.jp/en/research/labs/ims/med_sci_math/.
Collapse
|
26
|
Gene expression profiling of DBA/2J mice cochleae treated with l-methionine and valproic acid. GENOMICS DATA 2015; 5:323-5. [PMID: 26484279 PMCID: PMC4583681 DOI: 10.1016/j.gdata.2015.06.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2015] [Revised: 06/12/2015] [Accepted: 06/16/2015] [Indexed: 10/31/2022]
Abstract
DBA/2J mice, which have homozygous mutations in Cdh23 and Fscn2, are characterized by early onset hearing loss at as early as three-weeks of age (Noben-Trauth et al., 2003 [1]) and are an animal model for progressive hearing loss research. Recently, it has been reported that epigenetic regulatory pathways likely play an important role in hearing loss (Provenzano and Domann, 2007 [2]; Mutai et al., 2009 [3]; Waldhaus et al., 2012 [4]). We previously reported that DBA/2J mice injected subcutaneously with a combination of epigenetic modifying reagents, l-methionine (MET) as methyl donor and valproic acid (VPA) as a pan-histone deacetylases (Hdac) inhibitor, showed a significant attenuation of progressive hearing loss by measuring their auditory brainstem response (ABR) thresholds (Mutai et al., 2015 [5]). Here we present genome wide expression profiling of the DBA/2J mice cochleae, with and without treatment of MET and VPA, to identify the genes involved in the reduction of progressive hearing loss. The raw and normalized data were deposited in NCBI's Gene Expression Omnibus (GEO ID: GSE62173) for ease of reproducibility and reanalysis.
Collapse
|
27
|
Performance comparison of four commercial human whole-exome capture platforms. Sci Rep 2015; 5:12742. [PMID: 26235669 PMCID: PMC4522667 DOI: 10.1038/srep12742] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Accepted: 07/08/2015] [Indexed: 12/16/2022] Open
Abstract
Whole exome sequencing (WXS) is widely used to identify causative genetic mutations of diseases. However, not only have several commercial human exome capture platforms been developed, but substantial updates have been released in the past few years. We report a performance comparison for the latest release of four commercial platforms, Roche/NimbleGen’s SeqCap EZ Human Exome Library v3.0, Illumina’s Nextera Rapid Capture Exome (v1.2), Agilent’s SureSelect XT Human All Exon v5 and Agilent’s SureSelect QXT, using the same DNA samples. Agilent XT showed the highest target enrichment efficiency and the best SNV and short indel detection sensitivity in coding regions with the least amount of sequencing. Agilent QXT had slightly inferior target enrichment than Agilent XT. Illumina, with additional sequencing, detected SNVs and short indels at the same quality as Agilent XT, and showed the best performance in coverage of medically interesting mutations. NimbleGen detected more SNVs and indels in untranslated regions than the others. We also found that the platforms, which enzymatically fragment the genomic DNA (gDNA), detected more homozygous SNVs than those using sonicated gDNA. We believe that our analysis will help investigators when selecting a suitable exome capture platform for their particular research.
Collapse
|
28
|
The construction of risk prediction models using GWAS data and its application to a type 2 diabetes prospective cohort. PLoS One 2014; 9:e92549. [PMID: 24651836 PMCID: PMC3961382 DOI: 10.1371/journal.pone.0092549] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2013] [Accepted: 02/24/2014] [Indexed: 02/07/2023] Open
Abstract
Recent genome-wide association studies (GWAS) have identified several novel single nucleotide polymorphisms (SNPs) associated with type 2 diabetes (T2D). Various models using clinical and/or genetic risk factors have been developed for T2D risk prediction. However, analysis considering algorithms for genetic risk factor detection and regression methods for model construction in combination with interactions of risk factors has not been investigated. Here, using genotype data of 7,360 Japanese individuals, we investigated risk prediction models, considering the algorithms, regression methods and interactions. The best model identified was based on a Bayes factor approach and the lasso method. Using nine SNPs and clinical factors, this method achieved an area under a receiver operating characteristic curve (AUC) of 0.8057 on an independent test set. With the addition of a pair of interaction factors, the model was further improved (p-value 0.0011, AUC 0.8085). Application of our model to prospective cohort data showed significantly better outcome in disease-free survival, according to the log-rank trend test comparing Kaplan-Meier survival curves (p--value 2:09 x 10(-11)). While the major contribution was from clinical factors rather than the genetic factors, consideration of genetic risk factors contributed to an observable, though small, increase in predictive ability. This is the first report to apply risk prediction models constructed from GWAS data to a T2D prospective cohort. Our study shows our model to be effective in prospective prediction and has the potential to contribute to practical clinical use in T2D.
Collapse
|
29
|
Whole-genome sequencing of liver cancers identifies etiological influences on mutation patterns and recurrent mutations in chromatin regulators. Nat Genet 2012. [PMID: 22634756 DOI: 10.1038/ng.2291.] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related death worldwide. We sequenced and analyzed the whole genomes of 27 HCCs, 25 of which were associated with hepatitis B or C virus infections, including two sets of multicentric tumors. Although no common somatic mutations were identified in the multicentric tumor pairs, their whole-genome substitution patterns were similar, suggesting that these tumors developed from independent mutations, although their shared etiological backgrounds may have strongly influenced their somatic mutation patterns. Statistical and functional analyses yielded a list of recurrently mutated genes. Multiple chromatin regulators, including ARID1A, ARID1B, ARID2, MLL and MLL3, were mutated in ∼50% of the tumors. Hepatitis B virus genome integration in the TERT locus was frequently observed in a high clonal proportion. Our whole-genome sequencing analysis of HCCs identified the influence of etiological background on somatic mutation patterns and subsequent carcinogenesis, as well as recurrent mutations in chromatin regulators in HCCs.
Collapse
|
30
|
Whole-genome sequencing of liver cancers identifies etiological influences on mutation patterns and recurrent mutations in chromatin regulators. Nat Genet 2012; 44:760-4. [PMID: 22634756 DOI: 10.1038/ng.2291] [Citation(s) in RCA: 681] [Impact Index Per Article: 56.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2012] [Accepted: 04/30/2012] [Indexed: 12/12/2022]
Abstract
Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related death worldwide. We sequenced and analyzed the whole genomes of 27 HCCs, 25 of which were associated with hepatitis B or C virus infections, including two sets of multicentric tumors. Although no common somatic mutations were identified in the multicentric tumor pairs, their whole-genome substitution patterns were similar, suggesting that these tumors developed from independent mutations, although their shared etiological backgrounds may have strongly influenced their somatic mutation patterns. Statistical and functional analyses yielded a list of recurrently mutated genes. Multiple chromatin regulators, including ARID1A, ARID1B, ARID2, MLL and MLL3, were mutated in ∼50% of the tumors. Hepatitis B virus genome integration in the TERT locus was frequently observed in a high clonal proportion. Our whole-genome sequencing analysis of HCCs identified the influence of etiological background on somatic mutation patterns and subsequent carcinogenesis, as well as recurrent mutations in chromatin regulators in HCCs.
Collapse
|
31
|
Comparative genomics identifies candidate genes for infectious salmon anemia (ISA) resistance in Atlantic salmon (Salmo salar). MARINE BIOTECHNOLOGY (NEW YORK, N.Y.) 2011; 13:232-41. [PMID: 20396924 PMCID: PMC3084937 DOI: 10.1007/s10126-010-9284-0] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2009] [Accepted: 03/04/2010] [Indexed: 05/25/2023]
Abstract
Infectious salmon anemia (ISA) has been described as the hoof and mouth disease of salmon farming. ISA is caused by a lethal and highly communicable virus, which can have a major impact on salmon aquaculture, as demonstrated by an outbreak in Chile in 2007. A quantitative trait locus (QTL) for ISA resistance has been mapped to three microsatellite markers on linkage group (LG) 8 (Chr 15) on the Atlantic salmon genetic map. We identified bacterial artificial chromosome (BAC) clones and three fingerprint contigs from the Atlantic salmon physical map that contains these markers. We made use of the extensive BAC end sequence database to extend these contigs by chromosome walking and identified additional two markers in this region. The BAC end sequences were used to search for conserved synteny between this segment of LG8 and the fish genomes that have been sequenced. An examination of the genes in the syntenic segments of the tetraodon and medaka genomes identified candidates for association with ISA resistance in Atlantic salmon based on differential expression profiles from ISA challenges or on the putative biological functions of the proteins they encode. One gene in particular, HIV-EP2/MBP-2, caught our attention as it may influence the expression of several genes that have been implicated in the response to infection by infectious salmon anemia virus (ISAV). Therefore, we suggest that HIV-EP2/MBP-2 is a very strong candidate for the gene associated with the ISAV resistance QTL in Atlantic salmon and is worthy of further study.
Collapse
|
32
|
Genomic organization and evolution of the Atlantic salmon hemoglobin repertoire. BMC Genomics 2010; 11:539. [PMID: 20923558 PMCID: PMC3091688 DOI: 10.1186/1471-2164-11-539] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2010] [Accepted: 10/05/2010] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND The genomes of salmonids are considered pseudo-tetraploid undergoing reversion to a stable diploid state. Given the genome duplication and extensive biological data available for salmonids, they are excellent model organisms for studying comparative genomics, evolutionary processes, fates of duplicated genes and the genetic and physiological processes associated with complex behavioral phenotypes. The evolution of the tetrapod hemoglobin genes is well studied; however, little is known about the genomic organization and evolution of teleost hemoglobin genes, particularly those of salmonids. The Atlantic salmon serves as a representative salmonid species for genomics studies. Given the well documented role of hemoglobin in adaptation to varied environmental conditions as well as its use as a model protein for evolutionary analyses, an understanding of the genomic structure and organization of the Atlantic salmon α and β hemoglobin genes is of great interest. RESULTS We identified four bacterial artificial chromosomes (BACs) comprising two hemoglobin gene clusters spanning the entire α and β hemoglobin gene repertoire of the Atlantic salmon genome. Their chromosomal locations were established using fluorescence in situ hybridization (FISH) analysis and linkage mapping, demonstrating that the two clusters are located on separate chromosomes. The BACs were sequenced and assembled into scaffolds, which were annotated for putatively functional and pseudogenized hemoglobin-like genes. This revealed that the tail-to-tail organization and alternating pattern of the α and β hemoglobin genes are well conserved in both clusters, as well as that the Atlantic salmon genome houses substantially more hemoglobin genes, including non-Bohr β globin genes, than the genomes of other teleosts that have been sequenced. CONCLUSIONS We suggest that the most parsimonious evolutionary path leading to the present organization of the Atlantic salmon hemoglobin genes involves the loss of a single hemoglobin gene cluster after the whole genome duplication (WGD) at the base of the teleost radiation but prior to the salmonid-specific WGD, which then produced the duplicated copies seen today. We also propose that the relatively high number of hemoglobin genes as well as the presence of non-Bohr β hemoglobin genes may be due to the dynamic life history of salmon and the diverse environmental conditions that the species encounters.Data deposition: BACs S0155C07 and S0079J05 (fps135): GenBank GQ898924; BACs S0055H05 and S0014B03 (fps1046): GenBank GQ898925.
Collapse
|
33
|
Assessing the feasibility of GS FLX Pyrosequencing for sequencing the Atlantic salmon genome. BMC Genomics 2008; 9:404. [PMID: 18755037 PMCID: PMC2532694 DOI: 10.1186/1471-2164-9-404] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2008] [Accepted: 08/28/2008] [Indexed: 11/16/2022] Open
Abstract
Background With a whole genome duplication event and wealth of biological data, salmonids are excellent model organisms for studying evolutionary processes, fates of duplicated genes and genetic and physiological processes associated with complex behavioral phenotypes. It is surprising therefore, that no salmonid genome has been sequenced. Atlantic salmon (Salmo salar) is a good representative salmonid for sequencing given its importance in aquaculture and the genomic resources available. However, the size and complexity of the genome combined with the lack of a sequenced reference genome from a closely related fish makes assembly challenging. Given the cost and time limitations of Sanger sequencing as well as recent improvements to next generation sequencing technologies, we examined the feasibility of using the Genome Sequencer (GS) FLX pyrosequencing system to obtain the sequence of a salmonid genome. Eight pooled BACs belonging to a minimum tiling path covering ~1 Mb of the Atlantic salmon genome were sequenced by GS FLX shotgun and Long Paired End sequencing and compared with a ninth BAC sequenced by Sanger sequencing of a shotgun library. Results An initial assembly using only GS FLX shotgun sequences (average read length 248.5 bp) with ~30× coverage allowed gene identification, but was incomplete even when 126 Sanger-generated BAC-end sequences (~0.09× coverage) were incorporated. The addition of paired end sequencing reads (additional ~26× coverage) produced a final assembly comprising 175 contigs assembled into four scaffolds with 171 gaps. Sanger sequencing of the ninth BAC (~10.5× coverage) produced nine contigs and two scaffolds. The number of scaffolds produced by the GS FLX assembly was comparable to Sanger-generated sequencing; however, the number of gaps was much higher in the GS FLX assembly. Conclusion These results represent the first use of GS FLX paired end reads for de novo sequence assembly. Our data demonstrated that this improved the GS FLX assemblies; however, with respect to de novo sequencing of complex genomes, the GS FLX technology is limited to gene mining and establishing a set of ordered sequence contigs. Currently, for a salmonid reference sequence, it appears that a substantial portion of sequencing should be done using Sanger technology.
Collapse
|
34
|
Piecing together a ciliome. Trends Genet 2006; 22:491-500. [PMID: 16860433 DOI: 10.1016/j.tig.2006.07.006] [Citation(s) in RCA: 155] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2006] [Revised: 05/15/2006] [Accepted: 07/06/2006] [Indexed: 10/24/2022]
Abstract
Cilia are slender microtubule-based appendages that emanate from the surfaces of a large proportion of eukaryotic cells. The motile and non-motile forms of cilia represent bona fide organelles comprising distinct repertoires of proteins that serve specific roles in locomotion or fluid movement, and sense chemical or physical extracellular cues. Owing in part to the growing number of genes associated with ciliary disorders, such as polycystic kidney disease and Bardet-Biedl syndrome, there has been a recent profusion of studies aimed at unveiling the protein makeup of cilia. The approaches used are complementary, involving several different organisms and spanning the fields of bioinformatics, genomics and proteomics. Here we review these studies and assess the various data sets to help define a comprehensive ciliary proteome, or 'ciliome'. We have compiled a cilia protein database that includes known cilia-associated proteins and numerous putative ciliary proteins including RAB-like small GTPases, which might be implicated in vesicular trafficking, and the microtubule-binding protein MIP-T3, some of which might be associated with ciliopathies.
Collapse
|
35
|
Functional genomics of the cilium, a sensory organelle. Curr Biol 2005; 15:935-41. [PMID: 15916950 DOI: 10.1016/j.cub.2005.04.059] [Citation(s) in RCA: 217] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2005] [Revised: 04/13/2005] [Accepted: 04/18/2005] [Indexed: 11/30/2022]
Abstract
Cilia and flagella play important roles in many physiological processes, including cell and fluid movement, sensory perception, and development. The biogenesis and maintenance of cilia depend on intraflagellar transport (IFT), a motility process that operates bidirectionally along the ciliary axoneme. Disruption in IFT and cilia function causes several human disorders, including polycystic kidneys, retinal dystrophy, neurosensory impairment, and Bardet-Biedl syndrome (BBS). To uncover new ciliary components, including IFT proteins, we compared C. elegans ciliated neuronal and nonciliated cells through serial analysis of gene expression (SAGE) and screened for genes potentially regulated by the ciliogenic transcription factor, DAF-19. Using these complementary approaches, we identified numerous candidate ciliary genes and confirmed the ciliated-cell-specific expression of 14 novel genes. One of these, C27H5.7a, encodes a ciliary protein that undergoes IFT. As with other IFT proteins, its ciliary localization and transport is disrupted by mutations in IFT and bbs genes. Furthermore, we demonstrate that the ciliary structural defect of C. elegans dyf-13(mn396) mutants is caused by a mutation in C27H5.7a. Together, our findings help define a ciliary transcriptome and suggest that DYF-13, an evolutionarily conserved protein, is a novel core IFT component required for cilia function.
Collapse
|