1
|
Interrogating the Human Diplome: Computational Methods, Emerging Applications, and Challenges. Methods Mol Biol 2023; 2590:1-30. [PMID: 36335489 DOI: 10.1007/978-1-0716-2819-5_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Human DNA sequencing protocols have revolutionized human biology, biomedical science, and clinical practice, but still have very important limitations. One limitation is that most protocols do not separate or assemble (i.e., "phase") the nucleotide content of each of the maternally and paternally derived chromosomal homologs making up the 22 autosomal pairs and the chromosomal pair making up the pseudo-autosomal region of the sex chromosomes. This has led to a dearth of studies and a consequent underappreciation of many phenomena of fundamental importance to basic and clinical genomic science. We discuss a few protocols for obtaining phase information as well as their limitations, including those that could be used in tumor phasing settings. We then describe a number of biological and clinical phenomena that require phase information. These include phenomena that require precise knowledge of the nucleotide sequence in a chromosomal segment from germline or somatic cells, such as DNA binding events, and insight into unique cis vs. trans-acting functionally impactful variant combinations-for example, variants implicated in a phenotype governed by compound heterozygosity. In addition, we also comment on the need for reliable and consensus-based diploid-context computational workflows for variant identification as well as the need for laboratory-based functional verification strategies for validating cis vs. trans effects of variant combinations. We also briefly describe available resources, example studies, as well as areas of further research, and ultimately argue that the science behind the study of human diploidy, referred to as "diplomics," which will be enabled by nucleotide-level resolution of phased genomes, is a logical next step in the analysis of human genome biology.
Collapse
|
2
|
Technological advances in cancer immunity: from immunogenomics to single-cell analysis and artificial intelligence. Signal Transduct Target Ther 2021; 6:312. [PMID: 34417437 PMCID: PMC8377461 DOI: 10.1038/s41392-021-00729-7] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 07/06/2021] [Accepted: 07/18/2021] [Indexed: 02/07/2023] Open
Abstract
Immunotherapies play critical roles in cancer treatment. However, given that only a few patients respond to immune checkpoint blockades and other immunotherapeutic strategies, more novel technologies are needed to decipher the complicated interplay between tumor cells and the components of the tumor immune microenvironment (TIME). Tumor immunomics refers to the integrated study of the TIME using immunogenomics, immunoproteomics, immune-bioinformatics, and other multi-omics data reflecting the immune states of tumors, which has relied on the rapid development of next-generation sequencing. High-throughput genomic and transcriptomic data may be utilized for calculating the abundance of immune cells and predicting tumor antigens, referring to immunogenomics. However, as bulk sequencing represents the average characteristics of a heterogeneous cell population, it fails to distinguish distinct cell subtypes. Single-cell-based technologies enable better dissection of the TIME through precise immune cell subpopulation and spatial architecture investigations. In addition, radiomics and digital pathology-based deep learning models largely contribute to research on cancer immunity. These artificial intelligence technologies have performed well in predicting response to immunotherapy, with profound significance in cancer therapy. In this review, we briefly summarize conventional and state-of-the-art technologies in the field of immunogenomics, single-cell and artificial intelligence, and present prospects for future research.
Collapse
|
3
|
DIVIS: Integrated and Customizable Pipeline for Cancer Genome Sequencing Analysis and Interpretation. Front Oncol 2021; 11:672597. [PMID: 34168993 PMCID: PMC8217664 DOI: 10.3389/fonc.2021.672597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open
Abstract
Next-generation sequencing (NGS) has drastically enhanced human cancer research, but diverse sequencing strategies, complicated open-source software, and the identification of massive numbers of mutations have limited the clinical application of NGS. Here, we first presented GPyFlow, a lightweight tool that flexibly customizes, executes, and shares workflows. We then introduced DIVIS, a customizable pipeline based on GPyFlow that integrates read preprocessing, alignment, variant detection, and annotation of whole-genome sequencing, whole-exome sequencing, and gene-panel sequencing. By default, DIVIS screens variants from multiple callers and generates a standard variant-detection format list containing caller evidence for each sample, which is compatible with advanced analyses. Lastly, DIVIS generates a statistical report, including command lines, parameters, quality-control indicators, and mutation summary. DIVIS substantially facilitates complex cancer genome sequencing analyses by means of a single powerful and easy-to-use command. The DIVIS code is freely available at https://github.com/niu-lab/DIVIS, and the docker image can be downloaded from https://hub.docker.com/repository/docker/sunshinerain/divis.
Collapse
|
4
|
Identification and validation of a novel eight mutant-derived long non-coding RNAs signature as a prognostic biomarker for genome instability in low-grade glioma. Aging (Albany NY) 2021; 13:15164-15192. [PMID: 34081618 PMCID: PMC8221298 DOI: 10.18632/aging.203079] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Accepted: 05/11/2021] [Indexed: 04/08/2023]
Abstract
Long non-coding RNAs (lncRNAs) comprise an integral part of the eukaryotic transcriptome. Alongside proteins, lncRNAs modulate lncRNA-based gene signatures of unstable transcripts, play a crucial role as antisense lncRNAs to control intracellular homeostasis and are implicated in tumorigenesis. However, the role of genomic instability-associated lncRNAs in low-grade gliomas (LGG) has not been fully explored. In this study, lncRNAs expression and somatic mutation profiles in low-grade glioma genome were used to identify eight novel mutant-derived genomic instability-associated lncRNAs including H19, FLG-AS1, AC091932.1, AC064875.1, AL138767.3, AC010273.2, AC131097.4 and ISX-AS1. Patients from the LGG gene mutagenome atlas were grouped into training and validation sets to test the performance of the signature. The genomic instability-associated lncRNAs signature (GILncSig) was then validated using multiple external cohorts. A total of 59 novel genomic instability-associated lncRNAs in LGG were used for least absolute shrinkage and selection operator (Lasso), single and multifactor Cox regression analysis using the training set. Furthermore, the independent predictive role of risk features in the training and validation sets were evaluated through survival analysis, receiver operating feature analysis and construction of a nomogram. Patients with IDH1 mutation status were grouped into two different risk groups based on the GILncSig score. The low-risk group showed a relatively higher rate of IDH1 mutations compared with patients in the high-risk group. Furthermore, patients in the low-risk group had better prognosis compared with patients in the high-risk group. In summary, this study reports a reliable prognostic prediction signature and provides a basis for further investigation of the role of lncRNAs on genomic instability. In addition, lncRNAs in the signature can be used as new targets for treatment of LGG.
Collapse
|
5
|
Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services. J Am Med Inform Assoc 2020; 27:1425-1430. [PMID: 32719837 PMCID: PMC7534581 DOI: 10.1093/jamia/ocaa068] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 03/20/2020] [Accepted: 04/17/2020] [Indexed: 01/14/2023] Open
Abstract
Objective Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies. Methods We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset. Results Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics. Conclusions We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?
Collapse
|
6
|
Comprehensive fundamental somatic variant calling and quality management strategies for human cancer genomes. Brief Bioinform 2020; 22:5854402. [PMID: 32510555 DOI: 10.1093/bib/bbaa083] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 04/19/2020] [Accepted: 04/21/2020] [Indexed: 12/21/2022] Open
Abstract
Next-generation sequencing (NGS) technology has revolutionised human cancer research, particularly via detection of genomic variants with its ultra-high-throughput sequencing and increasing affordability. However, the inundation of rich cancer genomics data has resulted in significant challenges in its exploration and translation into biological insights. One of the difficulties in cancer genome sequencing is software selection. Currently, multiple tools are widely used to process NGS data in four stages: raw sequence data pre-processing and quality control (QC), sequence alignment, variant calling and annotation and visualisation. However, the differences between these NGS tools, including their installation, merits, drawbacks and application, have not been fully appreciated. Therefore, a systematic review of the functionality and performance of NGS tools is required to provide cancer researchers with guidance on software and strategy selection. Another challenge is the multidimensional QC of sequencing data because QC can not only report varied sequence data characteristics but also reveal deviations in diverse features and is essential for a meaningful and successful study. However, monitoring of QC metrics in specific steps including alignment and variant calling is neglected in certain pipelines such as the 'Best Practices Workflows' in GATK. In this review, we investigated the most widely used software for the fundamental analysis and QC of cancer genome sequencing data and provided instructions for selecting the most appropriate software and pipelines to ensure precise and efficient conclusions. We further discussed the prospects and new research directions for cancer genomics.
Collapse
|
7
|
DriverDBv3: a multi-omics database for cancer driver gene research. Nucleic Acids Res 2020; 48:D863-D870. [PMID: 31701128 PMCID: PMC7145679 DOI: 10.1093/nar/gkz964] [Citation(s) in RCA: 83] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2019] [Revised: 10/09/2019] [Accepted: 11/06/2019] [Indexed: 12/13/2022] Open
Abstract
An integrative multi-omics database is needed urgently, because focusing only on analysis of one-dimensional data falls far short of providing an understanding of cancer. Previously, we presented DriverDB, a cancer driver gene database that applies published bioinformatics algorithms to identify driver genes/mutations. The updated DriverDBv3 database (http://ngs.ym.edu.tw/driverdb) is designed to interpret cancer omics’ sophisticated information with concise data visualization. To offer diverse insights into molecular dysregulation/dysfunction events, we incorporated computational tools to define CNV and methylation drivers. Further, four new features, CNV, Methylation, Survival, and miRNA, allow users to explore the relations from two perspectives in the ‘Cancer’ and ‘Gene’ sections. The ‘Survival’ panel offers not only significant survival genes, but gene pairs synergistic effects determine. A fresh function, ‘Survival Analysis’ in ‘Customized-analysis,’ allows users to investigate the co-occurring events in user-defined gene(s) by mutation status or by expression in a specific patient group. Moreover, we redesigned the web interface and provided interactive figures to interpret cancer omics’ sophisticated information, and also constructed a Summary panel in the ‘Cancer’ and ‘Gene’ sections to visualize the features on multi-omics levels concisely. DriverDBv3 seeks to improve the study of integrative cancer omics data by identifying driver genes and contributes to cancer biology.
Collapse
|
8
|
Butler enables rapid cloud-based analysis of thousands of human genomes. Nat Biotechnol 2020; 38:288-292. [PMID: 32024987 PMCID: PMC7062635 DOI: 10.1038/s41587-019-0360-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Accepted: 07/05/2018] [Indexed: 11/08/2022]
Abstract
We present Butler, a computational tool that facilitates large-scale genomic analyses on public and academic clouds. Butler includes innovative anomaly detection and self-healing functions that improve the efficiency of data processing and analysis by 43% compared with current approaches. Butler enabled processing of a 725-terabyte cancer genome dataset from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project in a time-efficient and uniform manner.
Collapse
|
9
|
A Fast and Scalable Workflow for SNPs Detection in Genome Sequences Using Hadoop Map-Reduce. Genes (Basel) 2020; 11:E166. [PMID: 32033366 PMCID: PMC7074349 DOI: 10.3390/genes11020166] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 01/31/2020] [Accepted: 02/01/2020] [Indexed: 11/16/2022] Open
Abstract
Next generation sequencing (NGS) technologies produce a huge amount of biological data, which poses various issues such as requirements of high processing time and large memory. This research focuses on the detection of single nucleotide polymorphism (SNP) in genome sequences. Currently, SNPs detection algorithms face several issues, e.g., computational overhead cost, accuracy, and memory requirements. In this research, we propose a fast and scalable workflow that integrates Bowtie aligner with Hadoop based Heap SNP caller to improve the SNPs detection in genome sequences. The proposed workflow is validated through benchmark datasets obtained from publicly available web-portals, e.g., NCBI and DDBJ DRA. Extensive experiments have been performed and the results obtained are compared with Bowtie and BWA aligner in the alignment phase, while compared with GATK, FaSD, SparkGA, Halvade, and Heap in SNP calling phase. Experimental results analysis shows that the proposed workflow outperforms existing frameworks e.g., GATK, FaSD, Heap integrated with BWA and Bowtie aligners, SparkGA, and Halvade. The proposed framework achieved 22.46% more efficient F-score and 99.80% consistent accuracy on average. More, comparatively 0.21% mean higher accuracy is achieved. Moreover, SNP mining has also been performed to identify specific regions in genome sequences. All the frameworks are implemented with the default configuration of memory management. The observations show that all workflows have approximately same memory requirement. In the future, it is intended to graphically show the mined SNPs for user-friendly interaction, analyze and optimize the memory requirements as well.
Collapse
|
10
|
Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 2019; 173:355-370.e14. [PMID: 29625052 DOI: 10.1016/j.cell.2018.03.039] [Citation(s) in RCA: 501] [Impact Index Per Article: 100.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2017] [Revised: 02/24/2018] [Accepted: 03/15/2018] [Indexed: 12/20/2022]
Abstract
We conducted the largest investigation of predisposition variants in cancer to date, discovering 853 pathogenic or likely pathogenic variants in 8% of 10,389 cases from 33 cancer types. Twenty-one genes showed single or cross-cancer associations, including novel associations of SDHA in melanoma and PALB2 in stomach adenocarcinoma. The 659 predisposition variants and 18 additional large deletions in tumor suppressors, including ATM, BRCA1, and NF1, showed low gene expression and frequent (43%) loss of heterozygosity or biallelic two-hit events. We also discovered 33 such variants in oncogenes, including missenses in MET, RET, and PTPN11 associated with high gene expression. We nominated 47 additional predisposition variants from prioritized VUSs supported by multiple evidences involving case-control frequency, loss of heterozygosity, expression effect, and co-localization with mutations and modified residues. Our integrative approach links rare predisposition variants to functional consequences, informing future guidelines of variant classification and germline genetic testing in cancer.
Collapse
|
11
|
Real-World Evidence In Support Of Precision Medicine: Clinico-Genomic Cancer Data As A Case Study. Health Aff (Millwood) 2018; 37:765-772. [DOI: 10.1377/hlthaff.2017.1579] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
12
|
Perspective on Oncogenic Processes at the End of the Beginning of Cancer Genomics. Cell 2018; 173:305-320.e10. [PMID: 29625049 PMCID: PMC5916814 DOI: 10.1016/j.cell.2018.03.033] [Citation(s) in RCA: 210] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Revised: 02/20/2018] [Accepted: 03/13/2018] [Indexed: 12/21/2022]
Abstract
The Cancer Genome Atlas (TCGA) has catalyzed systematic characterization of diverse genomic alterations underlying human cancers. At this historic junction marking the completion of genomic characterization of over 11,000 tumors from 33 cancer types, we present our current understanding of the molecular processes governing oncogenesis. We illustrate our insights into cancer through synthesis of the findings of the TCGA PanCancer Atlas project on three facets of oncogenesis: (1) somatic driver mutations, germline pathogenic variants, and their interactions in the tumor; (2) the influence of the tumor genome and epigenome on transcriptome and proteome; and (3) the relationship between tumor and the microenvironment, including implications for drugs targeting driver events and immunotherapies. These results will anchor future characterization of rare and common tumor types, primary and relapsed tumors, and cancers across ancestry groups and will guide the deployment of clinical genomic sequencing.
Collapse
|