1
|
Moffett AS, Falcón-Cortés A, Di Pierro M. Quantifying the influence of genetic context on duplicated mammalian genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.04.03.647042. [PMID: 40236061 PMCID: PMC11996522 DOI: 10.1101/2025.04.03.647042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
Gene duplication is a fundamental part of evolutionary innovation. While single-gene duplications frequently exhibit asymmetric evolutionary rates between paralogs, the extent to which this applies to multi-gene duplications remains unclear. In this study, we investigate the role of genetic context in shaping evolutionary divergence within multi-gene duplications, leveraging microsynteny to differentiate source and target copies. Using a dataset of 193 mammalian genome assemblies and a bird outgroup, we systematically analyze patterns of sequence divergence between duplicated genes and reference orthologs. We find that target copies, those relocated to new genomic environments, exhibit elevated evolutionary rates compared to source copies in the ancestral location. This asymmetry is influenced by the distance between copies and the size of the target copy. We also demonstrate that the polarization of rate asymmetry in paralogs, the "choice" of the slowly evolving copy, is biased towards collective, block-wise polarization in multi-gene duplications. Our findings highlight the importance of genetic context in modulating post-duplication divergence, where differences in cis-regulatory elements and co-expressed gene clusters between source and target copies may be responsible. This study presents a large-scale test of asymmetric evolution in multi-gene duplications, offering new insight into how genome architecture shapes functional diversification of paralogs.
Collapse
|
2
|
Li X, Nguyen J, Korkut A. Recurrent Composite Markers of Cell Types and States. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.07.17.549344. [PMID: 37503180 PMCID: PMC10370072 DOI: 10.1101/2023.07.17.549344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Biological function is mediated by the hierarchical organization of cell types and states within tissue ecosystems. Identifying interpretable composite marker sets that both define and distinguish hierarchical cell identities is essential for decoding biological complexity, yet remains a major challenge. Here, we present RECOMBINE, an algorithm that identifies recurrent composite marker sets to define hierarchical cell identities. Validation using both simulated and biological datasets demonstrates that RECOMBINE achieves higher accuracy in identifying discriminative markers compared to existing approaches, including differential gene expression analysis. When applied to single-cell data and validated with spatial transcriptomics data from the mouse visual cortex, RECOMBINE identified key cell type markers and generated a robust gene panel for targeted spatial profiling. It also uncovered markers of CD8+; T cell states, including GZMK+;HAVCR2-; effector memory cells associated with anti-PD-1 therapy response, and revealed a rare intestinal subpopulation with composite markers in mice. Finally, using data from the Tabula Sapiens project, RECOMBINE identified composite marker sets across a broad range of human tissues. Together, these results highlight RECOMBINE as a robust, data-driven framework for optimized marker selection, enabling the discovery and validation of hierarchical cell identities across diverse tissue contexts.
Collapse
|
3
|
Kuluev AR, Matniyazov RT, Kuluev BR, Chemeris DA, Chemeris AV. Complete chloroplast genomes of five Aegilops aucheri Boiss. accessions having different geographical origins. Mitochondrial DNA A DNA Mapp Seq Anal 2025; 35:119-125. [PMID: 40074559 DOI: 10.1080/24701394.2025.2476401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 03/03/2025] [Indexed: 03/14/2025]
Abstract
The subject of this study is Aegilops aucheri Boiss. 1844: a member of the section Sitopsis, subsection Truncata. This species is infrequently included in phylogenetic studies and is commonly regarded as a heterotypic synonym of Aegilops speltoides Tausch. The aim of this study was to detect genetic differences between Ae. aucheri and Ae. speltoides using the phylogenetic signal retrieved from chloroplast genomes. Plastomes of five Ae. aucheri accessions from different geographical locations were sequenced, annotated, and subjected to a phylogenetic analysis. Plastome sizes were found to range between 135,666 and 135,668 bp in Ae. aucheri. Comparative analysis of the chloroplast genome sequences from five Ae. aucheri accessions revealed single-nucleotide polymorphisms (SNPs) and insertions/deletions (indels) relative to the Ae. speltoides plastome. To gain a more comprehensive understanding of the genetic divergence within the Truncata subsection, sequencing the nuclear genome of Ae. aucheri and comparing it to that of Ae. speltoides is essential.
Collapse
Affiliation(s)
- Azat R Kuluev
- Institute of Biochemistry and Genetics of Ufa, Federal Research Centre of RAS, Ufa, Russia
| | - Rustam T Matniyazov
- Institute of Biochemistry and Genetics of Ufa, Federal Research Centre of RAS, Ufa, Russia
| | - Bulat R Kuluev
- Institute of Biochemistry and Genetics of Ufa, Federal Research Centre of RAS, Ufa, Russia
| | - Dmitry A Chemeris
- Institute of Biochemistry and Genetics of Ufa, Federal Research Centre of RAS, Ufa, Russia
| | - Alexey V Chemeris
- Institute of Biochemistry and Genetics of Ufa, Federal Research Centre of RAS, Ufa, Russia
| |
Collapse
|
4
|
Li Y, Xiao P, Boadu F, Goldkamp AK, Nirgude S, Cheng J, Hagen DE, Kalish JM, Rivera RM. Beckwith-Wiedemann syndrome and large offspring syndrome involve alterations in methylome, transcriptome, and chromatin configuration. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2023.12.14.23299981. [PMID: 38168424 PMCID: PMC10760283 DOI: 10.1101/2023.12.14.23299981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
Beckwith-Wiedemann Syndrome (BWS) is the most common epigenetic overgrowth syndrome, caused by epigenetic alterations on chromosome 11p15. In ∼50% of patients with BWS, the imprinted region KvDMR1 (IC2) is hypomethylated. Nearly all children with BWS develop organ overgrowth and up to 28% develop cancer during childhood. The global epigenetic alterations beyond the 11p15 region in BWS are not currently known. Uncovering these alterations at the methylome, transcriptome, and chromatin architecture levels are necessary steps to improve the diagnosis and understanding of patients with BWS. Here we characterized the complete epigenetic profiles of BWS IC2 individuals together with the animal model of BWS, bovine large offspring syndrome (LOS). A novel finding of this research is the identification of two molecular subgroups of BWS IC2 individuals. Genome-wide alternations were detected for DNA methylation, transcript abundance, alternative splicing events of RNA, chromosome compartments, and topologically associating domains (TADs) in BWS and LOS, with shared alterations identified between species. Altered chromosome compartments and TADs were correlated with differentially expressed genes in BWS and LOS. Together, we highlight genes and genomic regions that have the potential to serve as targets for biomarker development to improve current molecular diagnostic methodologies for BWS.
Collapse
|
5
|
Satas G, Myers MA, McPherson A, Shah SP. Inferring active mutational processes in cancer using single cell sequencing and evolutionary constraints. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.24.639589. [PMID: 40060559 PMCID: PMC11888314 DOI: 10.1101/2025.02.24.639589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/17/2025]
Abstract
Ongoing mutagenesis in cancer drives genetic diversity throughout the natural history of cancers. As the activities of mutational processes are dynamic throughout evolution, distinguishing the mutational signatures of 'active' and 'historical' processes has important implications for studying how tumors evolve. This can aid in understanding mutagenic states at the time of presentation, and in associating active mutational process with therapeutic resistance. As bulk sequencing primarily captures historical mutational processes, we studied whether ultra-low-coverage single-cell whole-genome sequencing (scWGS), which measures the distribution of mutations across hundreds or thousands of individual cells, could enable the distinction between historical and active mutational processes. While technical challenges and data sparsity have limited mutation analysis in scWGS, we show that these data contain valuable information about dynamic mutational processes. To robustly interpret single nucleotide variants (SNVs) in scWGS, we introduce ArtiCull, a method to identify and remove SNV artifacts by leveraging evolutionary constraints, enabling reliable detection of mutations for signature analysis. Applying this approach to scWGS data from pancreatic ductal adenocarcinoma (PDAC), triple-negative breast cancer (TNBC), and high-grade serous ovarian cancer (HGSOC), we uncover temporal and spatial patterns in mutational processes. In PDAC, we observe a temporal increase in mismatch repair deficiency (MMRd). In cisplatin-treated TNBC patient-derived xenografts, we identify therapy-induced mutagenesis and inactivation of APOBEC3 activity. In HGSOC, we show distinct patterns of APOBEC3 mutagenesis, including late tumor-wide activation in one case and clade-specific enrichment in another. Additionally, we detect a clone-specific increase in SBS17 activity, in a clone previously linked to recurrence. Our findings establish ultra-low-coverage scWGS as a powerful approach for studying active mutational processes that may influence ongoing clonal evolution and therapeutic resistance.
Collapse
Affiliation(s)
- Gryte Satas
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- The Halvorsen Center for Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Matthew A. Myers
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- The Halvorsen Center for Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Andrew McPherson
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- The Halvorsen Center for Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Sohrab P. Shah
- Computational Oncology, Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- The Halvorsen Center for Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| |
Collapse
|
6
|
Kumari P, Friedman RZ, Pi L, Curtis SW, Paraiso K, Visel A, Rhea L, Dunnwald M, Patni AP, Mar D, Bomsztyk K, Mathieu J, Ruohola-Baker H, Leslie EJ, White MA, Cohen BA, Cornell RA. Identification of functional non-coding variants associated with orofacial cleft. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.01.596914. [PMID: 40027800 PMCID: PMC11870446 DOI: 10.1101/2024.06.01.596914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Oral facial cleft (OFC) is a multifactorial disorder that can present as a cleft lip with or without cleft palate (CL/P) or a cleft palate only. Genome wide association studies (GWAS) of isolated OFC have identified common single nucleotide polymorphisms (SNPs) at the 1q32/ IRF6 locus and many other loci where, like IRF6 , the presumed OFC-relevant gene is expressed in embryonic oral epithelium. To identify the functional subset of SNPs at eight such loci we conducted a massively parallel reporter assay in a cell line derived from fetal oral epithelium, revealing SNPs with allele-specific effects on enhancer activity. We filtered these against chromatin-mark evidence of enhancers in relevant cell types or tissues, and then tested a subset in traditional reporter assays, yielding six candidates for functional SNPs in five loci (1q32/ IRF6 , 3q28/ TP63 , 6p24.3/ TFAP2A , 20q12/ MAFB , and 9q22.33/ FOXE1 ). We further tested two SNPs near IRF6 and one near FOXE1 by engineering the genome of induced pluripotent stem cells, differentiating the cells into embryonic oral epithelium, and measuring expression of IRF6 or FOXE1 and binding of transcription factors; the results strongly supported their candidacy. Conditional analyses of a meta-analysis of GWAS suggest that the two functional SNPs near IRF6 account for the majority of risk for CL/P associated with variation at this locus. This study connects genetic variation associated with orofacial cleft to mechanisms of pathogenesis.
Collapse
|
7
|
Sant C, Mucke L, Corces MR. CHOIR improves significance-based detection of cell types and states from single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.01.18.576317. [PMID: 38328105 PMCID: PMC10849522 DOI: 10.1101/2024.01.18.576317] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Clustering is a critical step in the analysis of single-cell data, as it enables the discovery and characterization of putative cell types and states. However, most popular clustering tools do not subject clustering results to statistical inference testing, leading to risks of overclustering or underclustering data and often resulting in ineffective identification of cell types with widely differing prevalence. To address these challenges, we present CHOIR (clustering hierarchy optimization by iterative random forests), which applies a framework of random forest classifiers and permutation tests across a hierarchical clustering tree to statistically determine which clusters represent distinct populations. We demonstrate the enhanced performance of CHOIR through extensive benchmarking against 14 existing clustering methods across 100 simulated and 4 real single-cell RNA-seq, ATAC-seq, spatial transcriptomic, and multi-omic datasets. CHOIR can be applied to any single-cell data type and provides a flexible, scalable, and robust solution to the important challenge of identifying biologically relevant cell groupings within heterogeneous single-cell data.
Collapse
Affiliation(s)
- Cathrine Sant
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA
- Neuroscience Graduate Program, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Lennart Mucke
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA
- Neuroscience Graduate Program, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Neurology and Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - M. Ryan Corces
- Gladstone Institute of Neurological Disease, Gladstone Institutes, San Francisco, CA, USA
- Neuroscience Graduate Program, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Neurology and Weill Institute for Neurosciences, University of California, San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
8
|
Czech E, Millar TR, Tyler W, White T, Elsworth B, Guez J, Hancox J, Jeffery B, Karczewski KJ, Miles A, Tallman S, Unneberg P, Wojdyla R, Zabad S, Hammerbacher J, Kelleher J. Analysis-ready VCF at Biobank scale using Zarr. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.06.11.598241. [PMID: 38915693 PMCID: PMC11195102 DOI: 10.1101/2024.06.11.598241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of the VCF data model (either as text or packed binary) emphasises efficient retrieval of all data for a given variant, but accessing data on a field or sample basis is inefficient. Biobank scale datasets currently available consist of hundreds of thousands of whole genomes and hundreds of terabytes of compressed VCF. Row-wise data storage is fundamentally unsuitable and a more scalable approach is needed. Results Zarr is a format for storing multi-dimensional data that is widely used across the sciences, and is ideally suited to massively parallel processing. We present the VCF Zarr specification, an encoding of the VCF data model using Zarr, along with fundamental software infrastructure for efficient and reliable conversion at scale. We show how this format is far more efficient than standard VCF based approaches, and competitive with specialised methods for storing genotype data in terms of compression ratios and single-threaded calculation performance. We present case studies on subsets of three large human datasets (Genomics England: n=78,195; Our Future Health: n=651,050; All of Us: n=245,394) along with whole genome datasets for Norway Spruce (n=1,063) and SARS-CoV-2 (n=4,484,157). We demonstrate the potential for VCF Zarr to enable a new generation of high-performance and cost-effective applications via illustrative examples using cloud computing and GPUs. Conclusions Large row-encoded VCF files are a major bottleneck for current research, and storing and processing these files incurs a substantial cost. The VCF Zarr specification, building on widely-used, open-source technologies has the potential to greatly reduce these costs, and may enable a diverse ecosystem of next-generation tools for analysing genetic variation data directly from cloud-based object stores, while maintaining compatibility with existing file-oriented workflows.
Collapse
Affiliation(s)
- Eric Czech
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Timothy R. Millar
- The New Zealand Institute for Plant & Food Research Ltd, Lincoln, New Zealand
- Department of Biochemistry, School of Biomedical Sciences, University of Otago, Dunedin, New Zealand
| | | | - Tom White
- Tom White Consulting Ltd., Manchester, UK
| | | | - Jérémy Guez
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
| | | | - Ben Jeffery
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| | - Konrad J. Karczewski
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Alistair Miles
- Wellcome Sanger Institute, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Sam Tallman
- Genomics England, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Per Unneberg
- Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | | | - Shadi Zabad
- School of Computer Science, McGill University, Montreal, QC, Canada
| | - Jeff Hammerbacher
- Open Athena AI Foundation, Lincoln, New Zealand
- Related Sciences, Lincoln, New Zealand
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, UK
| |
Collapse
|
9
|
Koyyalagunta D, Ganesh K, Morris Q. Inferring cancer type-specific patterns of metastatic spread using Metient. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.07.09.602790. [PMID: 39282311 PMCID: PMC11398359 DOI: 10.1101/2024.07.09.602790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
Cancers differ in how they establish metastases. These differences can be studied by reconstructing the metastatic spread of a cancer from sequencing data of multiple tumors. Current methods to do so are limited by computational scalability and rely on technical assumptions that do not reflect current clinical knowledge. Metient overcomes these limitations using gradient-based, multi-objective optimization to generate multiple hypotheses of metastatic spread and rescores these hypotheses using independent data on genetic distance and organotropism. Unlike current methods, Metient can be used with both clinical sequencing data and barcode-based lineage tracing in preclinical models, enhancing its translatability across systems. In a reanalysis of metastasis in 169 patients and 490 tumors, Metient automatically identifies cancer type-specific trends of metastatic dissemination in melanoma, high-risk neuroblastoma, and non-small cell lung cancer. Its reconstructions often align with expert analyses but frequently reveal more plausible migration histories, including those with more metastasis-to-metastasis seeding and higher polyclonal seeding, offering new avenues for targeting metastatic cells. Metient's findings challenge existing assumptions about metastatic spread, enhance our understanding of cancer type-specific metastasis, and offer insights that inform future clinical treatment strategies of metastasis.
Collapse
Affiliation(s)
- Divya Koyyalagunta
- Tri-Institutional Graduate Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY 10065, USA
- Computational and Systems Biology Program, Sloan Kettering Institute, New York, NY 10065, USA
| | - Karuna Ganesh
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Molecular Pharmacology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Quaid Morris
- Tri-Institutional Graduate Program in Computational Biology and Medicine, Weill Cornell Medicine, New York, NY 10065, USA
- Computational and Systems Biology Program, Sloan Kettering Institute, New York, NY 10065, USA
| |
Collapse
|
10
|
Li Q, Nichols C, Welner RS, Chen JY, Ku WS, Yue Z. Toden-E: Topology-Based and Density-Based Ensembled Clustering for the Development of Super-PAG in Functional Genomics using PAG Network and LLM. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.20.619308. [PMID: 39484450 PMCID: PMC11526983 DOI: 10.1101/2024.10.20.619308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
The integrative analysis of gene sets, networks, and pathways is pivotal for deciphering omics data in translational biomedical research. To significantly increase gene coverage and enhance the utility of pathways, annotated gene lists, and gene signatures from diverse sources, we introduced pathways, annotated gene lists, and gene signatures (PAGs) enriched with metadata to represent biological functions. Furthermore, we established PAG-PAG networks by leveraging gene member similarity and gene regulations. However, in practice, high similarity in functional descriptions or gene membership often leads to redundant PAGs, hindering the interpretation from a fuzzy enriched PAG list. In this study, we developed todenE (topology-based and density-based ensemble) clustering, pioneering in integrating topology-based and density-based clustering methods to detect PAG communities leveraging the PAG network and Large Language Models (LLM). In computational genomics annotation, the genes can be grouped/clustered through the gene relationships and gene functions via guilt by association. Similarly, PAGs can be grouped into higher-level clusters, forming concise functional representations called Super-PAGs. TodenE captures PAG-PAG similarity and encapsulates functional information through LLM, in characterizing network-based functional Super-PAGs. In synthetic data, we introduced a metric called the Disparity Index (DI), measuring the connectivity of gene neighbors to gauge clusterability. We compared multiple clustering algorithms to identify the best method for generating performance-driven clusters. In non-simulated data (Gene Ontology), by leveraging transfer learning and LLM, we formed a language-based similarity embedding. TodenE utilizes this embedding together with the topology-based embedding to generate putative Super-PAGs with superior performance in semantic and gene member inclusiveness.
Collapse
|
11
|
Turner TC, Pittman FS, Zhang H, Hymel LA, Zheng T, Behara M, Anderson SE, Harrer JA, Link KA, Ahammed MA, Maner-Smith K, Liu X, Yin X, Lim HS, Spite M, Qiu P, García AJ, Mortensen LJ, Jang YC, Willett NJ, Botchwey EA. Improving Functional Muscle Regeneration in Volumetric Muscle Loss Injuries by Shifting the Balance of Inflammatory and Pro-Resolving Lipid Mediators. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.06.611741. [PMID: 39314313 PMCID: PMC11418947 DOI: 10.1101/2024.09.06.611741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Severe tissue loss resulting from extremity trauma, such as volumetric muscle loss (VML), poses significant clinical challenges for both general and military populations. VML disrupts the endogenous tissue repair mechanisms, resulting in acute and unresolved chronic inflammation and immune cell presence, impaired muscle healing, scar tissue formation, persistent pain, and permanent functional deficits. The aberrant healing response is preceded by acute inflammation and immune cell infiltration which does not resolve. We analyzed the biosynthesis of inflammatory and specialized pro-resolving lipid mediators (SPMs) after VML injury in two different models; muscle with critical-sized defects had a decreased capacity to biosynthesize SPMs, leading to dysregulated and persistent inflammation. We developed a modular poly(ethylene glycol)-maleimide hydrogel platform to locally release a stable isomer of Resolvin D1 (AT-RvD1) and promote endogenous pathways of inflammation resolution in the two muscle models. The local delivery of AT-RvD1 enhanced muscle regeneration, improved muscle function, and reduced pain sensitivity after VML by promoting molecular and cellular resolution of inflammation. These findings provide new insights into the pathogenesis of VML and establish a pro-resolving hydrogel therapeutic as a promising strategy for promoting functional muscle regeneration after traumatic injury.
Collapse
|
12
|
Sampaio IW, Tassi E, Bellani M, Benedetti F, Nenadic I, Phillips M, Piras F, Yatham L, Bianchi AM, Brambilla P, Maggioni E. A generalizable normative deep autoencoder for brain morphological anomaly detection: application to the multi-site StratiBip dataset on bipolar disorder in an external validation framework. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.04.611239. [PMID: 39282436 PMCID: PMC11398360 DOI: 10.1101/2024.09.04.611239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/21/2024]
Abstract
The heterogeneity of psychiatric disorders makes researching disorder-specific neurobiological markers an ill-posed problem. Here, we face the need for disease stratification models by presenting a generalizable multivariate normative modelling framework for characterizing brain morphology, applied to bipolar disorder (BD). We employed deep autoencoders in an anomaly detection framework, combined with a confounder removal step integrating training and external validation. The model was trained with healthy control (HC) data from the human connectome project and applied to multi-site external data of HC and BD individuals. We found that brain deviating scores were greater, more heterogeneous, and with increased extreme values in the BD group, with volumes prominently from the basal ganglia, hippocampus and adjacent regions emerging as significantly deviating. Similarly, individual brain deviating maps based on modified z scores expressed higher abnormalities occurrences, but their overall spatial overlap was lower compared to HCs. Our generalizable framework enabled the identification of subject- and group-level brain normative-deviating patterns, a step forward towards the development of more effective and personalized clinical decision support systems and patient stratification in psychiatry.
Collapse
Affiliation(s)
- Inês Won Sampaio
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| | - Emma Tassi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
- Department of Neurosciences and Mental Health, Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Marcella Bellani
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Psychiatry, University of Verona, Verona, Italy
| | - Francesco Benedetti
- Division of Neuroscience, Unit of Psychiatry and Clinical Psychobiology, IRCCS Ospedale San Raffaele, Milan, Italy
| | - Igor Nenadic
- Department of Psychiatry and Psychotherapy, Philipps-University Marburg, Marburg, Germany
| | - Mary Phillips
- Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | | | - Lakshmi Yatham
- Department of Psychiatry, University of British Columbia, Vancouver, BC, Canada
| | - Anna Maria Bianchi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| | - Paolo Brambilla
- Department of Neurosciences and Mental Health, Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico, Milan, Italy
- Department of Pathophysiology and Transplantation, University of Milan, Milan, Italy
| | - Eleonora Maggioni
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy
| |
Collapse
|
13
|
Lehle JD, Lin YH, Gomez A, Chavez L, McCarrey JR. Endocrine disruptor-induced epimutagenesis in vitro : Insight into molecular mechanisms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.05.574355. [PMID: 38746310 PMCID: PMC11092511 DOI: 10.1101/2024.01.05.574355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Endocrine disrupting chemicals (EDCs) such as bisphenol S (BPS) are xenobiotic compounds that can disrupt endocrine signaling following exposure due to steric similarities to endogenous hormones within the body. EDCs have been shown to induce disruptions in normal epigenetic programming (epimutations) that accompany dysregulation of normal gene expression patterns that appear to predispose disease states. Most interestingly, the prevalence of epimutations following exposure to many different EDCs often persists over multiple subsequent generations, even with no further exposure to the causative EDC. Many previous studies have described both the direct and prolonged effects of EDC exposure in animal models, but many questions remain about molecular mechanisms by which EDCs initially induce epimutations or contribute to the propagation of EDC-induced epimutations either within the exposed generation or to subsequent generations. Additional questions remain regarding the extent to which there may be differences in cell-type specific susceptibilities to various EDCs, and whether this susceptibility is correlative with expression of relevant hormone receptors and/or the location of relevant hormone response elements (HREs) in the genome. To address these questions, we exposed cultured mouse pluripotent (induced pluripotent stem [iPS]), somatic (Sertoli and granulosa), and germ (primordial germ cell like [PGCLC]) cells to BPS and measured changes in DNA methylation levels at the epigenomic level and gene expression at the transcriptomic level. We found that there was indeed a difference in cell-type specific susceptibility to EDC-induced epimutagenesis and that this susceptibility correlated with differential expression of relevant hormone receptors and, in many cases, tended to generate epimutations near relevant HREs within the genome. Additionally, however, we also found that BPS can induce epimutations in a cell type that does not express relevant receptors and in genomic regions that do not contain relevant HREs, suggesting that both canonical and non-canonical signaling mechanisms can be disrupted by BPS exposure. Most interestingly, we found that when iPS cells were exposed to BPS and then induced to differentiate into PGCLCs, the prevalence of epimutations and differentially expressed genes (DEGs) initially induced in the iPSCs was largely retained in the resulting PGCLCs, however, >90% of the specific epimutations and DEGs were not conserved but were rather replaced by novel epimutations and DEGs following the iPSC to PGCLC transition. These results are consistent with a unique concept that many EDC-induced epimutations may normally be corrected by germline and/or embryonic epigenetic reprogramming but that due to disruption of the underlying chromatin architecture induced by the EDC exposure, many novel epimutations may emerge during the reprogramming process as well. Thus, it appears that following exposure to a disruptive agent such as an EDC, a prevalence of epimutations may transcend epigenetic reprogramming even though most individual epimutations are not conserved during this process.
Collapse
|
14
|
Liu Y, Carbonetto P, Willwerscheid J, Oakes SA, Macleod KF, Stephens M. Dissecting tumor transcriptional heterogeneity from single-cell RNA-seq data by generalized binary covariance decomposition. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.08.15.553436. [PMID: 37645713 PMCID: PMC10462040 DOI: 10.1101/2023.08.15.553436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Profiling tumors with single-cell RNA sequencing (scRNA-seq) has the potential to identify recurrent patterns of transcription variation related to cancer progression, and produce new therapeutically relevant insights. However, the presence of strong inter-tumor heterogeneity often obscures more subtle patterns that are shared across tumors, some of which may characterize clinically relevant disease subtypes. Here we introduce a new statistical method, generalized binary covariance decomposition (GBCD), to address this problem. We show that GBCD can help decompose transcriptional heterogeneity into interpretable components - including patient-specific, dataset-specific and shared components relevant to disease subtypes - and that, in the presence of strong inter-tumor heterogeneity, it can produce more interpretable results than existing methods. Applied to data from three studies on pancreatic cancer adenocarcinoma (PDAC), GBCD produces a refined characterization of existing tumor subtypes (e.g., classical vs. basal), and identifies a new gene expression program (GEP) that is prognostic of poor survival independent of established prognostic factors such as tumor stage and subtype. The new GEP is enriched for genes involved in a variety of stress responses, and suggests a potentially important role for the integrated stress response in PDAC development and prognosis.
Collapse
|
15
|
Belica CA, Carpenter MA, Chen Y, Brown WL, Moeller NH, Boylan IT, Harris RS, Aihara H. A real-time biochemical assay for quantitative analyses of APOBEC-catalyzed DNA deamination. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.11.593688. [PMID: 38766133 PMCID: PMC11100776 DOI: 10.1101/2024.05.11.593688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Over the past decade, the connection between APOBEC3 cytosine deaminases and cancer mutagenesis has become increasingly apparent. This growing awareness has created a need for biochemical tools that can be used to identify and characterize potential inhibitors of this enzyme family. In response to this challenge, we have developed a Real-time APOBEC3-mediated DNA Deamination (RADD) assay. This assay offers a single-step set-up and real-time fluorescent read-out, and it is capable of providing insights into enzyme kinetics and also offering a high-sensitivity and easily scalable method for identifying APOBEC3 inhibitors. This assay serves as a crucial addition to the existing APOBEC3 biochemical and cellular toolkit and possesses the versatility to be readily adapted into a high-throughput format for inhibitor discovery.
Collapse
Affiliation(s)
- Christopher A. Belica
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, Minnesota, USA, 55455
- Institute for Molecular Virology, University of Minnesota, Minneapolis, Minnesota, 55455, USA
- Masonic Cancer Center, University of Minnesota, Minneapolis, Minnesota, 55455, USA
| | - Michael A. Carpenter
- Department of Biochemistry and Structural Biology, University of Texas Health San Antonio, San Antonio, Texas, 78229, USA
- Howard Hughes Medical Institute, University of Texas Health San Antonio, San Antonio, Texas, 78229, USA
| | - Yanjun Chen
- Department of Biochemistry and Structural Biology, University of Texas Health San Antonio, San Antonio, Texas, 78229, USA
| | - William L. Brown
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, Minnesota, USA, 55455
- Institute for Molecular Virology, University of Minnesota, Minneapolis, Minnesota, 55455, USA
- Masonic Cancer Center, University of Minnesota, Minneapolis, Minnesota, 55455, USA
| | - Nicholas H. Moeller
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, Minnesota, USA, 55455
- Institute for Molecular Virology, University of Minnesota, Minneapolis, Minnesota, 55455, USA
- Masonic Cancer Center, University of Minnesota, Minneapolis, Minnesota, 55455, USA
| | - Ian T. Boylan
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, Minnesota, USA, 55455
- Masonic Cancer Center, University of Minnesota, Minneapolis, Minnesota, 55455, USA
| | - Reuben S. Harris
- Department of Biochemistry and Structural Biology, University of Texas Health San Antonio, San Antonio, Texas, 78229, USA
- Howard Hughes Medical Institute, University of Texas Health San Antonio, San Antonio, Texas, 78229, USA
| | - Hideki Aihara
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, Minnesota, USA, 55455
- Institute for Molecular Virology, University of Minnesota, Minneapolis, Minnesota, 55455, USA
- Masonic Cancer Center, University of Minnesota, Minneapolis, Minnesota, 55455, USA
| |
Collapse
|
16
|
Tagami D, Bisschop G, Kelleher J. tstrait: a quantitative trait simulator for ancestral recombination graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.13.584790. [PMID: 38559118 PMCID: PMC10980058 DOI: 10.1101/2024.03.13.584790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Summary Ancestral recombination graphs (ARGs) encode the ensemble of correlated genealogical trees arising from recombination in a compact and efficient structure, and are of fundamental importance in population and statistical genetics. Recent breakthroughs have made it possible to simulate and infer ARGs at biobank scale, and there is now intense interest in using ARG-based methods across a broad range of applications, particularly in genome-wide association studies (GWAS). Sophisticated methods exist to simulate ARGs using population genetics models, but there is currently no software to simulate quantitative traits directly from these ARGs. To apply existing quantitative trait simulators users must export genotype data, losing important information about ancestral processes and producing prohibitively large files when applied to the biobank-scale datasets currently of interest in GWAS. We present tstrait, an open-source Python library to simulate quantitative traits on ARGs, and show how this user-friendly software can quickly simulate phenotypes for biobank-scale datasets on a laptop computer. Availability and Implementation tstrait is available for download on the Python Package Index. Full documentation with examples and workflow templates is available on https://tskit.dev/tstrait/docs/, and the development version is maintained on GitHub (https://github.com/tskit-dev/tstrait). Contact daiki.tagami@hertford.ox.ac.uk.
Collapse
Affiliation(s)
- Daiki Tagami
- Department of Statistics, University of Oxford, 24-29 St Giles’, Oxford OX1 3LB, United Kingdom
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Gertjan Bisschop
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| | - Jerome Kelleher
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Old Road Campus, Oxford OX3 7LF, United Kingdom
| |
Collapse
|
17
|
Zeng X, Ding Y, Zhang Y, Uddin MR, Dabouei A, Xu M. DUAL: deep unsupervised simultaneous simulation and denoising for cryo-electron tomography. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.02.583135. [PMID: 38496657 PMCID: PMC10942334 DOI: 10.1101/2024.03.02.583135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Recent biotechnological developments in cryo-electron tomography allow direct visualization of native sub-cellular structures with unprecedented details and provide essential information on protein functions/dysfunctions. Denoising can enhance the visualization of protein structures and distributions. Automatic annotation via data simulation can ameliorate the time-consuming manual labeling of large-scale datasets. Here, we combine the two major cryo-ET tasks together in DUAL, by a specific cyclic generative adversarial network with novel noise disentanglement. This enables end-to-end unsupervised learning that requires no labeled data for training. The denoising branch outperforms existing works and substantially improves downstream particle picking accuracy on benchmark datasets. The simulation branch provides learning-based cryo-ET simulation for the first time and generates synthetic tomograms indistinguishable from experimental ones. Through comprehensive evaluations, we showcase the effectiveness of DUAL in detecting macromolecular complexes across a wide range of molecular weights in experimental datasets. The versatility of DUAL is expected to empower cryo-ET researchers by improving visual interpretability, enhancing structural detection accuracy, expediting annotation processes, facilitating cross-domain model adaptability, and compensating for missing wedge artifacts. Our work represents a significant advancement in the unsupervised mining of protein structures in cryo-ET, offering a multifaceted tool that facilitates cryo-ET research.
Collapse
Affiliation(s)
- Xiangrui Zeng
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Yizhe Ding
- Department of Statistics, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Yueqian Zhang
- School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
| | - Mostofa Rafid Uddin
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Ali Dabouei
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Min Xu
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| |
Collapse
|
18
|
Nwizu C, Hughes M, Ramseier ML, Navia AW, Shalek AK, Fusi N, Raghavan S, Winter PS, Amini AP, Crawford L. Scalable nonparametric clustering with unified marker gene selection for single-cell RNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.11.579839. [PMID: 38405697 PMCID: PMC10888887 DOI: 10.1101/2024.02.11.579839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Clustering is commonly used in single-cell RNA-sequencing (scRNA-seq) pipelines to characterize cellular heterogeneity. However, current methods face two main limitations. First, they require user-specified heuristics which add time and complexity to bioinformatic workflows; second, they rely on post-selective differential expression analyses to identify marker genes driving cluster differences, which has been shown to be subject to inflated false discovery rates. We address these challenges by introducing nonparametric clustering of single-cell populations (NCLUSION): an infinite mixture model that leverages Bayesian sparse priors to identify marker genes while simultaneously performing clustering on single-cell expression data. NCLUSION uses a scalable variational inference algorithm to perform these analyses on datasets with up to millions of cells. By analyzing publicly available scRNA-seq studies, we demonstrate that NCLUSION (i) matches the performance of other state-of-the-art clustering techniques with significantly reduced runtime and (ii) provides statistically robust and biologically relevant transcriptomic signatures for each of the clusters it identifies. Overall, NCLUSION represents a reliable hypothesis-generating tool for understanding patterns of expression variation present in single-cell populations.
Collapse
Affiliation(s)
- Chibuikem Nwizu
- Center for Computational Molecular Biology, Brown University, Providence, RI, USA
- Warren Alpert Medical School of Brown University, Providence, RI, USA
| | | | - Michelle L. Ramseier
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Andrew W. Navia
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Alex K. Shalek
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
- Harvard Medical School, Boston, MA, USA
- Ragon Institute of MGH, MIT, and Harvard, Cambridge, MA, USA
| | | | - Srivatsan Raghavan
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA, USA
| | - Peter S. Winter
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | | | - Lorin Crawford
- Center for Computational Molecular Biology, Brown University, Providence, RI, USA
- Microsoft Research, Cambridge, MA, USA
- Department of Biostatistics, Brown University, Providence, RI, USA
| |
Collapse
|
19
|
Wang Z, Zhan Q, Yang S, Mu S, Chen J, Garai S, Orzechowski P, Wagenaar J, Shen L. QOT: Efficient Computation of Sample Level Distance Matrix from Single-Cell Omics Data through Quantized Optimal Transport. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.06.578032. [PMID: 38370767 PMCID: PMC10871252 DOI: 10.1101/2024.02.06.578032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Single-cell technologies have emerged as a transformative technology enabling high-dimensional characterization of cell populations at an unprecedented scale. The data's innate complexity and voluminous nature pose significant computational and analytical challenges, especially in comparative studies delineating cellular architectures across various biological conditions (i.e., generation of sample level distance matrices). Optimal Transport (OT) is a mathematical tool that captures the intrinsic structure of data geometrically and has been applied to many bioinformatics tasks. In this paper, we propose QOT (Quantized Optimal Transport), a new method enables efficient computation of sample level distance matrix from large-scale single-cell omics data through a quantization step. We apply our algorithm to real-world single-cell genomics and pathomics datasets, aiming to extrapolate cell-level insights to inform sample level categorizations. Our empirical study shows that QOT outperforms OT-based algorithms in terms of accuracy and robustness when obtaining a distance matrix at the sample level from high throughput single-cell measures. Moreover, the sample level distance matrix could be used in downstream analysis (i.e. uncover the trajectory of disease progression), highlighting its usage in biomedical informatics and data science.
Collapse
Affiliation(s)
- Zexuan Wang
- Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania
| | - Qipeng Zhan
- Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania
| | - Shu Yang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Shizhuo Mu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Jiong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Sumita Garai
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Patryk Orzechowski
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
- AGH University of Science and Technology, Poland
| | - Joost Wagenaar
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Li Shen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| |
Collapse
|
20
|
Muller E, Shiryan I, Borenstein E. Multi-omic integration of microbiome data for identifying disease-associated modules. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.03.547607. [PMID: 37461534 PMCID: PMC10349976 DOI: 10.1101/2023.07.03.547607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2023]
Abstract
The human gut microbiome is a complex ecosystem with profound implications for health and disease. This recognition has led to a surge in multi-omic microbiome studies, employing various molecular assays to elucidate the microbiome's role in diseases across multiple functional layers. However, despite the clear value of these multi-omic datasets, rigorous integrative analysis of such data poses significant challenges, hindering a comprehensive understanding of microbiome-disease interactions. Perhaps most notably, multiple approaches, including univariate and multivariate analyses, as well as machine learning, have been applied to such data to identify disease-associated markers, namely, specific features (e.g., species, pathways, metabolites) that are significantly altered in disease state. These methods, however, often yield extensive lists of features associated with the disease without effectively capturing the multi-layered structure of multi-omic data or offering clear, interpretable hypotheses about underlying microbiome-disease mechanisms. Here, we address this challenge by introducing MintTea - an intermediate integration-based method for analyzing multi-omic microbiome data. MintTea combines a canonical correlation analysis (CCA) extension, consensus analysis, and an evaluation protocol to robustly identify disease-associated multi-omic modules. Each such module consists of a set of features from the various omics that both shift in concord, and collectively associate with the disease. Applying MintTea to diverse case-control cohorts with multi-omic data, we show that this framework is able to capture modules with high predictive power for disease, significant cross-omic correlations, and alignment with known microbiome-disease associations. For example, analyzing samples from a metabolic syndrome (MS) study, we found a MS-associated module comprising of a highly correlated cluster of serum glutamate- and TCA cycle-related metabolites, as well as bacterial species previously implicated in insulin resistance. In another cohort, we identified a module associated with late-stage colorectal cancer, featuring Peptostreptococcus and Gemella species and several fecal amino acids, in agreement with these species' reported role in the metabolism of these amino acids and their coordinated increase in abundance during disease development. Finally, comparing modules identified in different datasets, we detected multiple significant overlaps, suggesting common interactions between microbiome features. Combined, this work serves as a proof of concept for the potential benefits of advanced integration methods in generating integrated multi-omic hypotheses underlying microbiome-disease interactions and a promising avenue for researchers seeking systems-level insights into coherent mechanisms governing microbiome-related diseases.
Collapse
|
21
|
Adelus ML, Ding J, Tran BT, Conklin AC, Golebiewski AK, Stolze LK, Whalen MB, Cusanovich DA, Romanoski CE. Single cell 'omic profiles of human aortic endothelial cells in vitro and human atherosclerotic lesions ex vivo reveals heterogeneity of endothelial subtype and response to activating perturbations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.03.535495. [PMID: 37066416 PMCID: PMC10104082 DOI: 10.1101/2023.04.03.535495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/18/2023]
Abstract
Objective Endothelial cells (ECs), macrophages, and vascular smooth muscle cells (VSMCs) are major cell types in atherosclerosis progression, and heterogeneity in EC sub-phenotypes are becoming increasingly appreciated. Still, studies quantifying EC heterogeneity across whole transcriptomes and epigenomes in both in vitro and in vivo models are lacking. Approach and Results To create an in vitro dataset to study human EC heterogeneity, multiomic profiling concurrently measuring transcriptomes and accessible chromatin in the same single cells was performed on six distinct primary cultures of human aortic ECs (HAECs). To model pro-inflammatory and activating environments characteristic of the atherosclerotic microenvironment in vitro, HAECs from at least three donors were exposed to three distinct perturbations with their respective controls: transforming growth factor beta-2 (TGFB2), interleukin-1 beta (IL1B), and siRNA-mediated knock-down of the endothelial transcription factor ERG (siERG). To form a comprehensive in vivo/ex vivo dataset of human atherosclerotic cell types, meta-analysis of single cell transcriptomes across 17 human arterial specimens was performed. Two computational approaches quantitatively evaluated the similarity in molecular profiles between heterogeneous in vitro and in vivo cell profiles. HAEC cultures were reproducibly populated by 4 major clusters with distinct pathway enrichment profiles: EC1-angiogenic, EC2-proliferative, EC3-activated/mesenchymal-like, and EC4-mesenchymal. Exposure to siERG, IL1B or TGFB2 elicited mostly distinct transcriptional and accessible chromatin responses. EC1 and EC2, the most canonically 'healthy' EC populations, were affected predominantly by siERG; the activated cluster EC3 was most responsive to IL1B; and the mesenchymal population EC4 was most affected by TGFB2. Quantitative comparisons between in vitro and in vivo transcriptomes confirmed EC1 and EC2 as most canonically EC-like, and EC4 as most mesenchymal with minimal effects elicited by siERG and IL1B. Lastly, accessible chromatin regions unique to EC2 and EC4 were most enriched for coronary artery disease (CAD)-associated SNPs from GWAS, suggesting these cell phenotypes harbor CAD-modulating mechanisms. Conclusion Primary EC cultures contain markedly heterogeneous cell subtypes defined by their molecular profiles. Surprisingly, the perturbations used here, which have been reported by others to be involved in the pathogenesis of atherosclerosis as well as induce endothelial-to-mesenchymal transition (EndMT), only modestly shifted cells between subpopulations, suggesting relatively stable molecular phenotypes in culture. Identifying consistently heterogeneous EC subpopulations between in vitro and in vivo models should pave the way for improving in vitro systems while enabling the mechanisms governing heterogeneous cell state decisions.
Collapse
Affiliation(s)
- Maria L. Adelus
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
- The Clinical Translational Sciences Graduate Program, The University of Arizona, Tucson, AZ, 85721, USA
| | - Jiacheng Ding
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
| | - Binh T. Tran
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
| | - Austin C. Conklin
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
| | - Anna K. Golebiewski
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
| | - Lindsey K. Stolze
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
| | - Michael B. Whalen
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
| | - Darren A. Cusanovich
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
- Asthma and Airway Disease Research Center, The University of Arizona, Tucson, AZ, 85721, USA
| | - Casey E. Romanoski
- The Department of Cellular and Molecular Medicine, The University of Arizona, Tucson, AZ 85721, USA
- The Clinical Translational Sciences Graduate Program, The University of Arizona, Tucson, AZ, 85721, USA
- Asthma and Airway Disease Research Center, The University of Arizona, Tucson, AZ, 85721, USA
| |
Collapse
|
22
|
Wieder C, Cooke J, Frainay C, Poupin N, Bowler R, Jourdan F, Kechris KJ, Lai RP, Ebbels T. PathIntegrate: Multivariate modelling approaches for pathway-based multi-omics data integration. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.09.574780. [PMID: 38260498 PMCID: PMC10802464 DOI: 10.1101/2024.01.09.574780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
As terabytes of multi-omics data are being generated, there is an ever-increasing need for methods facilitating the integration and interpretation of such data. Current multi-omics integration methods typically output lists, clusters, or subnetworks of molecules related to an outcome. Even with expert domain knowledge, discerning the biological processes involved is a time-consuming activity. Here we propose PathIntegrate, a method for integrating multi-omics datasets based on pathways, designed to exploit knowledge of biological systems and thus provide interpretable models for such studies. PathIntegrate employs single-sample pathway analysis to transform multi-omics datasets from the molecular to the pathway-level, and applies a predictive single-view or multi-view model to integrate the data. Model outputs include multi-omics pathways ranked by their contribution to the outcome prediction, the contribution of each omics layer, and the importance of each molecule in a pathway. Using semi-synthetic data we demonstrate the benefit of grouping molecules into pathways to detect signals in low signal-to-noise scenarios, as well as the ability of PathIntegrate to precisely identify important pathways at low effect sizes. Finally, using COPD and COVID-19 data we showcase how PathIntegrate enables convenient integration and interpretation of complex high-dimensional multi-omics datasets. The PathIntegrate Python package is available at https://github.com/cwieder/PathIntegrate.
Collapse
Affiliation(s)
- Cecilia Wieder
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Juliette Cooke
- Toxalim (Research Centre in Food Toxicology), Université de Toulouse, INRAE, ENVT, INP-Purpan, UPS, Toulouse, France
| | - Clement Frainay
- Toxalim (Research Centre in Food Toxicology), Université de Toulouse, INRAE, ENVT, INP-Purpan, UPS, Toulouse, France
| | - Nathalie Poupin
- Toxalim (Research Centre in Food Toxicology), Université de Toulouse, INRAE, ENVT, INP-Purpan, UPS, Toulouse, France
| | - Russell Bowler
- National Jewish Health, 1400 Jackson Street, Denver, CO, 80206, USA
| | - Fabien Jourdan
- MetaboHUB-Metatoul, National Infrastructure of Metabolomics and Fluxomics, Toulouse, France
| | - Katerina J Kechris
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, United States of America
| | - Rachel Pj Lai
- Department of Infectious Disease, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Timothy Ebbels
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion, and Reproduction, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
23
|
Shen Y, Yu L, Qiu Y, Zhang T, Kingsford C. Improving Hi-C contact matrices using genome graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566275. [PMID: 37986943 PMCID: PMC10659349 DOI: 10.1101/2023.11.08.566275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes. However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices. These enhanced matrices show a reduction in erroneous patterns caused by structural variations and are more effective in accurately capturing the structures of topologically associated domains.
Collapse
Affiliation(s)
- Yihang Shen
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
| | - Lingge Yu
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
| | - Yutong Qiu
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
| | - Tianyu Zhang
- Department of Statistics and Data Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA
| |
Collapse
|
24
|
Dhakal A, Gyawali R, Wang L, Cheng J. CryoTransformer: A Transformer Model for Picking Protein Particles from Cryo-EM Micrographs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.19.563155. [PMID: 37961171 PMCID: PMC10634673 DOI: 10.1101/2023.10.19.563155] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of large protein complexes. Picking single protein particles from cryo-EM micrographs (images) is a crucial step in reconstructing protein structures from them. However, the widely used template-based particle picking process requires some manual particle picking and is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) can potentially automate particle picking, the current AI methods pick particles with low precision or low recall. The erroneously picked particles can severely reduce the quality of reconstructed protein structures, especially for the micrographs with low signal-to-noise (SNR) ratios. To address these shortcomings, we devised CryoTransformer based on transformers, residual networks, and image processing techniques to accurately pick protein particles from cryo-EM micrographs. CryoTransformer was trained and tested on the largest labelled cryo-EM protein particle dataset - CryoPPP. It outperforms the current state-of-the-art machine learning methods of particle picking in terms of the resolution of 3D density maps reconstructed from the picked particles as well as F1-score and is poised to facilitate the automation of the cryo-EM protein particle picking.
Collapse
Affiliation(s)
- Ashwin Dhakal
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
- NextGen Precision Health, University of Missouri, Columbia, Columbia, MO 65211, USA
| | - Rajan Gyawali
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
- NextGen Precision Health, University of Missouri, Columbia, Columbia, MO 65211, USA
| | - Liguo Wang
- Laboratory for BioMolecular Structure (LBMS), Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO 65211, USA
- NextGen Precision Health, University of Missouri, Columbia, Columbia, MO 65211, USA
| |
Collapse
|
25
|
Hristov BH, Noble WS, Bertero A. Systematic identification of inter-chromosomal interaction networks supports the existence of RNA factories. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.21.558852. [PMID: 37790381 PMCID: PMC10542540 DOI: 10.1101/2023.09.21.558852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Most studies of genome organization have focused on intra-chromosomal (cis) contacts because they harbor key features such as DNA loops and topologically associating domains. Inter-chromosomal (trans) contacts have received much less attention, and tools for interrogating potential biologically relevant trans structures are lacking. Here, we develop a computational framework to identify sets of loci that jointly interact in trans from Hi-C data. This method, trans-C, initiates probabilistic random walks with restarts from a set of seed loci to traverse an input Hi-C contact network, thereby identifying sets of trans-contacting loci. We validate trans-C in three increasingly complex models of established trans contacts: the Plasmodium falciparum var genes, the mouse olfactory receptor "Greek islands", and the human RBM20 cardiac splicing factory. We then apply trans-C to systematically test the hypothesis that genes co-regulated by the same trans-acting element (i.e., a transcription or splicing factor) co-localize in three dimensions to form "RNA factories" that maximize the efficiency and accuracy of RNA biogenesis. We find that many loci with multiple binding sites of the same transcription factor interact with one another in trans, especially those bound by transcription factors with intrinsically disordered domains. Similarly, clustered binding of a subset of RNA binding proteins correlates with trans interaction of the encoding loci. These findings support the existence of trans interacting chromatin domains (TIDs) driven by RNA biogenesis. Trans-C provides an efficient computational framework for studying these and other types of trans interactions, empowering studies of a poorly understood aspect of genome architecture.
Collapse
Affiliation(s)
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
| | - Alessandro Bertero
- Molecular Biotechnology Center “Guido Tarone”, Dept. of Molecular Biotechnology and Health Sciences, University of Turin, Torino, Italy
| |
Collapse
|
26
|
Alston JJ, Soranno A, Holehouse AS. Conserved molecular recognition by an intrinsically disordered region in the absence of sequence conservation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.06.552128. [PMID: 37609146 PMCID: PMC10441348 DOI: 10.1101/2023.08.06.552128] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Intrinsically disordered regions (IDRs) are critical for cellular function, yet often appear to lack sequence conservation when assessed by multiple sequence alignments. This raises the question of if and how function can be encoded and preserved in these regions despite massive sequence variation. To address this question, we have applied coarse-grained molecular dynamics simulations to investigate non-specific RNA binding of coronavirus nucleocapsid proteins. Coronavirus nucleocapsid proteins consist of multiple interspersed disordered and folded domains that bind RNA. We focussed here on the first two domains of coronavirus nucleocapsid proteins, the disordered N-terminal domain (NTD) followed by the folded RNA binding domain (RBD). While the NTD is highly variable across evolution, the RBD is structurally conserved. This combination makes the NTD-RBD a convenient model system to explore the interplay between an IDR adjacent to a folded domain, and how changes in IDR sequence can influence molecular recognition of a partner. Our results reveal a surprising degree of sequence-specificity encoded by both the composition and the precise order of the amino acids in the NTD. The presence of an NTD can - depending on the sequence - either suppress or enhance RNA binding. Despite this sensitivity, large-scale variation in NTD sequences is possible while certain sequence features are retained. Consequently, a conformationally-conserved fuzzy RNA:protein complex is found across nucleocapsid protein orthologs, despite large-scale changes in both NTD sequence and RBD surface chemistry. Taken together, these insights shed light on the ability of disordered regions to preserve functional characteristics despite their sequence variability.
Collapse
|
27
|
Hope J, Beckerle T, Cheng PH, Viavattine Z, Feldkamp M, Fausner S, Saxena K, Ko E, Hryb I, Carter R, Ebner T, Kodandaramaiah S. Brain-wide neural recordings in mice navigating physical spaces enabled by a cranial exoskeleton. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.04.543578. [PMID: 37333228 PMCID: PMC10274744 DOI: 10.1101/2023.06.04.543578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Complex behaviors are mediated by neural computations occurring throughout the brain. In recent years, tremendous progress has been made in developing technologies that can record neural activity at cellular resolution at multiple spatial and temporal scales. However, these technologies are primarily designed for studying the mammalian brain during head fixation - wherein the behavior of the animal is highly constrained. Miniaturized devices for studying neural activity in freely behaving animals are largely confined to recording from small brain regions owing to performance limitations. We present a cranial exoskeleton that assists mice in maneuvering neural recording headstages that are orders of magnitude larger and heavier than the mice, while they navigate physical behavioral environments. Force sensors embedded within the headstage are used to detect the mouse's milli-Newton scale cranial forces which then control the x, y, and yaw motion of the exoskeleton via an admittance controller. We discovered optimal controller tuning parameters that enable mice to locomote at physiologically realistic velocities and accelerations while maintaining natural walking gait. Mice maneuvering headstages weighing up to 1.5 kg can make turns, navigate 2D arenas, and perform a navigational decision-making task with the same performance as when freely behaving. We designed an imaging headstage and an electrophysiology headstage for the cranial exoskeleton to record brain-wide neural activity in mice navigating 2D arenas. The imaging headstage enabled recordings of Ca2+ activity of 1000s of neurons distributed across the dorsal cortex. The electrophysiology headstage supported independent control of up to 4 silicon probes, enabling simultaneous recordings from 100s of neurons across multiple brain regions and multiple days. Cranial exoskeletons provide flexible platforms for largescale neural recording during the exploration of physical spaces, a critical new paradigm for unraveling the brain-wide neural mechanisms that control complex behavior.
Collapse
Affiliation(s)
- James Hope
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Travis Beckerle
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Pin-Hao Cheng
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Zoey Viavattine
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Michael Feldkamp
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Skylar Fausner
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Kapil Saxena
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Eunsong Ko
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
| | - Ihor Hryb
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
- Department of Neuroscience, University of Minnesota, Twin Cities
| | - Russell Carter
- Department of Biomedical Engineering, University of Minnesota, Twin Cities
| | - Timothy Ebner
- Department of Biomedical Engineering, University of Minnesota, Twin Cities
| | - Suhasa Kodandaramaiah
- Department of Mechanical Engineering, University of Minnesota, Twin Cities
- Department of Biomedical Engineering, University of Minnesota, Twin Cities
- Department of Neuroscience, University of Minnesota, Twin Cities
| |
Collapse
|
28
|
Xiao Y, Hou Y, Zhou H, Diallo G, Fiszman M, Wolfson J, Kilicoglu H, Chen Y, Su C, Xu H, Mantyh WG, Zhang R. Repurposing Non-pharmacological Interventions for Alzheimer's Diseases through Link Prediction on Biomedical Literature. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.05.15.23290002. [PMID: 37292731 PMCID: PMC10246059 DOI: 10.1101/2023.05.15.23290002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recently, computational drug repurposing has emerged as a promising method for identifying new pharmaceutical interventions (PI) for Alzheimer's Disease (AD). Non-pharmaceutical interventions (NPI), such as Vitamin E and Music therapy, have great potential to improve cognitive function and slow the progression of AD, but have largely been unexplored. This study predicts novel NPIs for AD through link prediction on our developed biomedical knowledge graph. We constructed a comprehensive knowledge graph containing AD concepts and various potential interventions, called ADInt, by integrating a dietary supplement domain knowledge graph, SuppKG, with semantic relations from SemMedDB database. Four knowledge graph embedding models (TransE, RotatE, DistMult and ComplEX) and two graph convolutional network models (R-GCN and CompGCN) were compared to learn the representation of ADInt. R-GCN outperformed other models by evaluating on the time slice test set and the clinical trial test set and was used to generate the score tables of the link prediction task. Discovery patterns were applied to generate mechanism pathways for high scoring triples. Our ADInt had 162,213 nodes and 1,017,319 edges. The graph convolutional network model, R-GCN, performed best in both the Time Slicing test set (MR = 7.099, MRR = 0.5007, Hits@1 = 0.4112, Hits@3 = 0.5058, Hits@10 = 0.6804) and the Clinical Trials test set (MR = 1.731, MRR = 0.8582, Hits@1 = 0.7906, Hits@3 = 0.9033, Hits@10 = 0.9848). Among high scoring triples in the link prediction results, we found the plausible mechanism pathways of (Photodynamic therapy, PREVENTS, Alzheimer's Disease) and (Choerospondias axillaris, PREVENTS, Alzheimer's Disease) by discovery patterns and discussed them further. In conclusion, we presented a novel methodology to extend an existing knowledge graph and discover NPIs (dietary supplements (DS) and complementary and integrative health (CIH)) for AD. We used discovery patterns to find mechanisms for predicted triples to solve the poor interpretability of artificial neural networks. Our method can potentially be applied to other clinical problems, such as discovering drug adverse reactions and drug-drug interactions.
Collapse
|
29
|
Dhakal A, Gyawali R, Wang L, Cheng J. CryoPPP: A Large Expert-Labelled Cryo-EM Image Dataset for Machine Learning Protein Particle Picking. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.21.529443. [PMID: 36865277 PMCID: PMC9980126 DOI: 10.1101/2023.02.21.529443] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
Abstract
Cryo-electron microscopy (cryo-EM) is currently the most powerful technique for determining the structures of large protein complexes and assemblies. Picking single-protein particles from cryo-EM micrographs (images) is a key step in reconstructing protein structures. However, the widely used template-based particle picking process is labor-intensive and time-consuming. Though the emerging machine learning-based particle picking can potentially automate the process, its development is severely hindered by lack of large, high-quality, manually labelled training data. Here, we present CryoPPP, a large, diverse, expert-curated cryo-EM image dataset for single protein particle picking and analysis to address this bottleneck. It consists of manually labelled cryo-EM micrographs of 32 non-redundant, representative protein datasets selected from the Electron Microscopy Public Image Archive (EMPIAR). It includes 9,089 diverse, high-resolution micrographs (∼300 cryo-EM images per EMPIAR dataset) in which the coordinates of protein particles were labelled by human experts. The protein particle labelling process was rigorously validated by both 2D particle class validation and 3D density map validation with the gold standard. The dataset is expected to greatly facilitate the development of machine learning and artificial intelligence methods for automated cryo-EM protein particle picking. The dataset and data processing scripts are available at https://github.com/BioinfoMachineLearning/cryoppp.
Collapse
Affiliation(s)
- Ashwin Dhakal
- Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, USA. Fax: 573-882-8318
| | - Rajan Gyawali
- Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, USA. Fax: 573-882-8318
| | - Liguo Wang
- Laboratory for BioMolecular Structure (LBMS), Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, NextGen Precision Health, University of Missouri, Columbia, MO 65211, USA. Fax: 573-882-8318
| |
Collapse
|
30
|
Zhang J, Singh R. Investigating the Complexity of Gene Co-expression Estimation for Single-cell Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.24.525447. [PMID: 36747724 PMCID: PMC9900775 DOI: 10.1101/2023.01.24.525447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
With the rapid advance of single-cell RNA sequencing (scRNA-seq) technology, understanding biological processes at a more refined single-cell level is becoming possible. Gene co-expression estimation is an essential step in this direction. It can annotate functionalities of unknown genes or construct the basis of gene regulatory network inference. This study thoroughly tests the existing gene co-expression estimation methods on simulation datasets with known ground truth co-expression networks. We generate these novel datasets using two simulation processes that use the parameters learned from the experimental data. We demonstrate that these simulations better capture the underlying properties of the real-world single-cell datasets than previously tested simulations for the task. Our performance results on tens of simulated and eight experimental datasets show that all methods produce estimations with a high false discovery rate potentially caused by high-sparsity levels in the data. Finally, we find that commonly used pre-processing approaches, such as normalization and imputation, do not improve the co-expression estimation. Overall, our benchmark setup contributes to the co-expression estimator development, and our study provides valuable insights for the community of single-cell data analyses.
Collapse
Affiliation(s)
- Jiaqi Zhang
- Department of Computer Science, Brown University
| | - Ritambhara Singh
- Department of Computer Science, Center for Computational Molecular Biology, Brown University
| |
Collapse
|
31
|
Kinnaman MD, Zaccaria S, Makohon-Moore A, Arnold B, Levine M, Gundem G, Ossa JEA, Glodzik D, Rodríguez-Sánchez MI, Bouvier N, Li S, Stockfisch E, Dunigan M, Cobbs C, Bhanot U, You D, Mullen K, Melchor J, Ortiz MV, O'Donohue T, Slotkin E, Wexler LH, Dela Cruz FS, Hameed M, Glade Bender JL, Tap WD, Meyers PA, Papaemmanuil E, Kung AL, Iacobuzio-Donahue CA. Subclonal somatic copy number alterations emerge and dominate in recurrent osteosarcoma. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.05.522765. [PMID: 36711976 PMCID: PMC9881990 DOI: 10.1101/2023.01.05.522765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Multiple large-scale tumor genomic profiling efforts have been undertaken in osteosarcoma, however, little is known about the spatial and temporal intratumor heterogeneity and how it may drive treatment resistance. We performed whole-genome sequencing of 37 tumor samples from eight patients with relapsed or refractory osteosarcoma. Each patient had at least one sample from a primary site and a metastatic or relapse site. We identified subclonal copy number alterations in all but one patient. We observed that in five patients, a subclonal copy number clone from the primary tumor emerged and dominated at subsequent relapses. MYC gain/amplification was enriched in the treatment-resistant clone in 6 out of 7 patients with more than one clone. Amplifications in other potential driver genes, such as CCNE1, RAD21, VEGFA, and IGF1R, were also observed in the resistant copy number clones. Our study sheds light on intratumor heterogeneity and the potential drivers of treatment resistance in osteosarcoma.
Collapse
Affiliation(s)
- Michael D Kinnaman
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Simone Zaccaria
- Cancer Research UK Lung Cancer Centre of Excellence, University College London Cancer Institute, London, UK
- Computational Cancer Genomics Research Group, University College London Cancer Institute, London, UK
| | - Alvin Makohon-Moore
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- David M. Rubenstein Center for Pancreatic Cancer Research, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
- Hackensack Meridian Health Center for Discovery and Innovation, Nutley, NJ, USA (current affiliation)
- Georgetown University Lombardi Comprehensive Cancer Center, Washington, DC, USA (current affiliation)
| | - Brian Arnold
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA
| | - Max Levine
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Isabl, New York, NY, USA (current affiliation)
| | - Gunes Gundem
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Juan E Arango Ossa
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Dominik Glodzik
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA (current affiliation)
| | - M Irene Rodríguez-Sánchez
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Wunderman Thompson Health, New York, NY, USA (current affiliation)
| | - Nancy Bouvier
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- IT and Digital Initiatives, Memorial Sloan Kettering Cancer Center, New York, NY, USA (current affiliation)
| | - Shanita Li
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Emily Stockfisch
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Marisa Dunigan
- Integrated Genomics Operation Core, Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Cassidy Cobbs
- Integrated Genomics Operation Core, Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Umesh Bhanot
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Precision Pathology Biobanking Center, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Daoqi You
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Katelyn Mullen
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Gerstner Sloan Kettering Graduate School of Biomedical Sciences, New York, NY, USA
| | - Jerry Melchor
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- David M. Rubenstein Center for Pancreatic Cancer Research, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Michael V Ortiz
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Tara O'Donohue
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Emily Slotkin
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Leonard H Wexler
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Filemon S Dela Cruz
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Meera Hameed
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Julia L Glade Bender
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - William D Tap
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Paul A Meyers
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Elli Papaemmanuil
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Andrew L Kung
- Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Christine A Iacobuzio-Donahue
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- David M. Rubenstein Center for Pancreatic Cancer Research, Memorial Sloan Kettering Cancer Center, New York, NY, USA
- Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| |
Collapse
|
32
|
Sashittal P, Zhang H, Iacobuzio-Donahue CA, Raphael BJ. ConDoR: Tumor phylogeny inference with a copy-number constrained mutation loss model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.05.522408. [PMID: 36711528 PMCID: PMC9882003 DOI: 10.1101/2023.01.05.522408] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Tumors consist of subpopulations of cells that harbor distinct collections of somatic mutations. These mutations range in scale from single nucleotide variants (SNVs) to large-scale copy-number aberrations (CNAs). While many approaches infer tumor phylogenies using SNVs as phylogenetic markers, CNAs that overlap SNVs may lead to erroneous phylogenetic inference. Specifically, an SNV may be lost in a cell due to a deletion of the genomic segment containing the SNV. Unfortunately, no current single-cell DNA sequencing (scDNA-seq) technology produces accurate measurements of both SNVs and CNAs. For instance, recent targeted scDNA-seq technologies, such as Mission Bio Tapestri, measure SNVs with high fidelity in individual cells, but yield much less reliable measurements of CNAs. We introduce a new evolutionary model, the constrained k-Dollo model, that uses SNVs as phylogenetic markers and partial information about CNAs in the form of clustering of cells with similar copy-number profiles. This copy-number clustering constrains where loss of SNVs can occur in the phylogeny. We develop ConDoR (Constrained Dollo Reconstruction), an algorithm to infer tumor phylogenies from targeted scDNA-seq data using the constrained k-Dollo model. We show that ConDoR outperforms existing methods on simulated data. We use ConDoR to analyze a new multi-region targeted scDNA-seq dataset of 2153 cells from a pancreatic ductal adenocarcinoma (PDAC) tumor and produce a more plausible phylogeny compared to existing methods that conforms to histological results for the tumor from a previous study. We also analyze a metastatic colorectal cancer dataset, deriving a more parsimonious phylogeny than previously published analyses and with a simpler monoclonal origin of metastasis compared to the original study. Code availability Software is available at https://github.com/raphael-group/constrained-Dollo.
Collapse
Affiliation(s)
| | - Haochen Zhang
- Gerstner Sloan Kettering Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, NY, USA
| | - Christine A. Iacobuzio-Donahue
- Human Oncology and Pathogenesis Program, Memorial Sloan Kettering Cancer Center, NY, USA
- David M. Rubenstein Center for Pancreatic Cancer Research, Memorial Sloan Kettering Cancer Center, NY, USA
- Department of Pathology and Laboratory Medicine, Memorial Sloan Kettering Cancer Center, NY, USA
| | | |
Collapse
|
33
|
Giorgashvili E, Reichel K, Caswara C, Kerimov V, Borsch T, Gruenstaeudl M. Software Choice and Sequencing Coverage Can Impact Plastid Genome Assembly-A Case Study in the Narrow Endemic Calligonum bakuense. FRONTIERS IN PLANT SCIENCE 2022; 13:779830. [PMID: 35874012 PMCID: PMC9296850 DOI: 10.3389/fpls.2022.779830] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Accepted: 06/13/2022] [Indexed: 06/15/2023]
Abstract
Most plastid genome sequences are assembled from short-read whole-genome sequencing data, yet the impact that sequencing coverage and the choice of assembly software can have on the accuracy of the resulting assemblies is poorly understood. In this study, we test the impact of both factors on plastid genome assembly in the threatened and rare endemic shrub Calligonum bakuense. We aim to characterize the differences across plastid genome assemblies generated by different assembly software tools and levels of sequencing coverage and to determine if these differences are large enough to affect the phylogenetic position inferred for C. bakuense compared to congeners. Four assembly software tools (FastPlast, GetOrganelle, IOGA, and NOVOPlasty) and seven levels of sequencing coverage across the plastid genome (original sequencing depth, 2,000x, 1,000x, 500x, 250x, 100x, and 50x) are compared in our analyses. The resulting assemblies are evaluated with regard to reproducibility, contig number, gene complement, inverted repeat length, and computation time; the impact of sequence differences on phylogenetic reconstruction is assessed. Our results show that software choice can have a considerable impact on the accuracy and reproducibility of plastid genome assembly and that GetOrganelle produces the most consistent assemblies for C. bakuense. Moreover, we demonstrate that a sequencing coverage between 500x and 100x can reduce both the sequence variability across assembly contigs and computation time. When comparing the most reliable plastid genome assemblies of C. bakuense, a sequence difference in only three nucleotide positions is detected, which is less than the difference potentially introduced through software choice.
Collapse
Affiliation(s)
- Eka Giorgashvili
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| | - Katja Reichel
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| | - Calvinna Caswara
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| | - Vuqar Kerimov
- Institute of Botany, Azerbaijan National Academy of Sciences (ANAS), Baku, Azerbaijan
| | - Thomas Borsch
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
- Botanischer Garten und Botanisches Museum Berlin, Freie Universität Berlin, Berlin, Germany
| | - Michael Gruenstaeudl
- Systematische Botanik und Pflanzengeographie, Institut für Biologie, Freie Universität Berlin, Berlin, Germany
| |
Collapse
|
34
|
Characterization of the complete chloroplast genome of Zephyranthes phycelloides ( Amaryllidaceae, tribe Hippeastreae) from Atacama region of Chile. Saudi J Biol Sci 2022; 29:650-659. [PMID: 35002462 PMCID: PMC8716934 DOI: 10.1016/j.sjbs.2021.10.035] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 10/13/2021] [Accepted: 10/14/2021] [Indexed: 11/21/2022] Open
Abstract
Sporadic rains in the Atacama Desert reveal a high biodiversity of plant species that only occur there. One of these rare species is the “Red añañuca” (Zephyranthes phycelloides), formerly known as Rhodophiala phycelloides. Many species of Zephyranthes in the Atacama Desert are dangerously threatened, due to massive extraction of bulbs and cutting of flowers. Therefore, studies of the biodiversity of these endemic species, which are essential for their conservation, should be conducted sooner rather than later. There are some chloroplast genomes available for Amaryllidaceae species, however there is no complete chloroplast genome available for any of the species of Zephyranthes subgenus Myostemma. The aim of the present work was to characterize and analyze the chloroplast of Z. phycelloides by NGS sequencing. The chloroplast genome of the Z. phycelloides consists of 158,107 bp, with typical quadripartite structures: a large single copy (LSC, 86,129 bp), a small single copy (SSC, 18,352 bp), and two inverted repeats (IR, 26,813 bp). One hundred thirty-seven genes were identified: 87 coding genes, 8 rRNA, 38 tRNA and 4 pseudogenes. The number of SSRs was 64 in Z. phycelloides and a total of 43 repeats were detected. The phylogenetic analysis of Z. phycelloides shows a distinct subclade with respect to Z. mesochloa. The average nucleotide variability (Pi) between Z. phycelloides and Z. mesochloa was of 0.02000, and seven loci with high variability were identified: psbA, trnSGCU-trnGUCC, trnDGUC-trnYGUA, trnLUAA-trnFGAA, rbcL, psbE-petL and ndhG-ndhI. The differences between the species are furthermore confirmed by the high amount of SNPs between these two species. Here, we report for the first time the complete cp genome of one species of the Zephyranthes subgenus Myostemma, which can be used for phylogenetic and population genomic studies.
Collapse
|
35
|
Pascual-Díaz JP, Garcia S, Vitales D. Plastome Diversity and Phylogenomic Relationships in Asteraceae. PLANTS (BASEL, SWITZERLAND) 2021; 10:plants10122699. [PMID: 34961169 PMCID: PMC8705268 DOI: 10.3390/plants10122699] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Revised: 12/01/2021] [Accepted: 12/04/2021] [Indexed: 06/14/2023]
Abstract
Plastid genomes are in general highly conserved given their slow evolutionary rate, and thus large changes in their structure are unusual. However, when specific rearrangements are present, they are often phylogenetically informative. Asteraceae is a highly diverse family whose evolution is long driven by polyploidy (up to 48x) and hybridization, both processes usually complicating systematic inferences. In this study, we generated one of the most comprehensive plastome-based phylogenies of family Asteraceae, providing information about the structure, genetic diversity and repeat composition of these sequences. By comparing the whole-plastome sequences obtained, we confirmed the double inversion located in the long single-copy region, for most of the species analyzed (with the exception of basal tribes), a well-known feature for Asteraceae plastomes. We also showed that genome size, gene order and gene content are highly conserved along the family. However, species representative of the basal subfamily Barnadesioideae-as well as in the sister family Calyceraceae-lack the pseudogene rps19 located in one inverted repeat. The phylogenomic analysis conducted here, based on 63 protein-coding genes, 30 transfer RNA genes and 21 ribosomal RNA genes from 36 species of Asteraceae, were overall consistent with the general consensus for the family's phylogeny while resolving the position of tribe Senecioneae and revealing some incongruences at tribe level between reconstructions based on nuclear and plastid DNA data.
Collapse
Affiliation(s)
- Joan Pere Pascual-Díaz
- Institut Botànic de Barcelona (IBB-CSIC), Passeig del Migdia s/n, 08038 Barcelona, Spain;
| | - Sònia Garcia
- Institut Botànic de Barcelona (IBB-CSIC), Passeig del Migdia s/n, 08038 Barcelona, Spain;
| | - Daniel Vitales
- Institut Botànic de Barcelona (IBB-CSIC), Passeig del Migdia s/n, 08038 Barcelona, Spain;
- Laboratori de Botànica–Unitat Associada CSIC, Facultat de Farmàcia i Ciències de l’Alimentació, Universitat de Barcelona, Av. Joan XXIII 27-31, 08028 Barcelona, Spain
| |
Collapse
|
36
|
Lam MTY, Duttke SH, Odish MF, Le HD, Hansen EA, Nguyen CT, Trescott S, Kim R, Deota S, Chang MW, Patel A, Hepokoski M, Alotaibi M, Rolfsen M, Perofsky K, Warden AS, Foley J, Ramirez SI, Dan JM, Abbott RK, Crotty S, Crotty Alexander LE, Malhotra A, Panda S, Benner CW, Coufal NG. Profiling Transcription Initiation in Peripheral Leukocytes Reveals Severity-Associated Cis-Regulatory Elements in Critical COVID-19. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2021:2021.08.24.457187. [PMID: 34462742 PMCID: PMC8404884 DOI: 10.1101/2021.08.24.457187] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The contribution of transcription factors (TFs) and gene regulatory programs in the immune response to COVID-19 and their relationship to disease outcome is not fully understood. Analysis of genome-wide changes in transcription at both promoter-proximal and distal cis-regulatory DNA elements, collectively termed the 'active cistrome,' offers an unbiased assessment of TF activity identifying key pathways regulated in homeostasis or disease. Here, we profiled the active cistrome from peripheral leukocytes of critically ill COVID-19 patients to identify major regulatory programs and their dynamics during SARS-CoV-2 associated acute respiratory distress syndrome (ARDS). We identified TF motifs that track the severity of COVID- 19 lung injury, disease resolution, and outcome. We used unbiased clustering to reveal distinct cistrome subsets delineating the regulation of pathways, cell types, and the combinatorial activity of TFs. We found critical roles for regulatory networks driven by stimulus and lineage determining TFs, showing that STAT and E2F/MYB regulatory programs targeting myeloid cells are activated in patients with poor disease outcomes and associated with single nucleotide genetic variants implicated in COVID-19 susceptibility. Integration with single-cell RNA-seq found that STAT and E2F/MYB activation converged in specific neutrophils subset found in patients with severe disease. Collectively we demonstrate that cistrome analysis facilitates insight into disease mechanisms and provides an unbiased approach to evaluate global changes in transcription factor activity and stratify patient disease severity.
Collapse
Affiliation(s)
- Michael Tun Yin Lam
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
- Laboratory of Regulatory Biology, Salk Institute of Biological Studies, La Jolla, CA, USA
| | - Sascha H. Duttke
- Division of Endocrinology and Metabolism, Department of Medicine, University of California, San Diego, CA, USA
| | - Mazen F. Odish
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
| | - Hiep D. Le
- Laboratory of Regulatory Biology, Salk Institute of Biological Studies, La Jolla, CA, USA
| | - Emily A. Hansen
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, CA, USA
| | | | - Samantha Trescott
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, CA, USA
| | - Roy Kim
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, CA, USA
| | - Shaunak Deota
- Laboratory of Regulatory Biology, Salk Institute of Biological Studies, La Jolla, CA, USA
| | - Max W. Chang
- Division of Endocrinology and Metabolism, Department of Medicine, University of California, San Diego, CA, USA
| | - Arjun Patel
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
| | - Mark Hepokoski
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
| | - Mona Alotaibi
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
| | - Mark Rolfsen
- Internal Medicine Residency Program, Department of Medicine, UC San Diego, CA, USA
| | - Katherine Perofsky
- Department of Pediatrics, University of California, San Diego, CA, USA
- Rady Children’s Hospital, San Diego, CA
| | - Anna S. Warden
- Division of Endocrinology and Metabolism, Department of Medicine, University of California, San Diego, CA, USA
| | | | - Sydney I Ramirez
- Division of Infectious Diseases, Department of Medicine, University of California, San Diego
- Center for Infectious Diseases and Vaccine Research, La Jolla Institute for Immunology (LJI), La Jolla, CA
| | - Jennifer M. Dan
- Division of Infectious Diseases, Department of Medicine, University of California, San Diego
- Center for Infectious Diseases and Vaccine Research, La Jolla Institute for Immunology (LJI), La Jolla, CA
| | - Robert K Abbott
- Center for Infectious Diseases and Vaccine Research, La Jolla Institute for Immunology (LJI), La Jolla, CA
- Consortium for HIV/AIDS Vaccine Development (CHVAD), The Scripps Research Institute, La Jolla, CA, USA
| | - Shane Crotty
- Center for Infectious Diseases and Vaccine Research, La Jolla Institute for Immunology (LJI), La Jolla, CA
| | - Laura E Crotty Alexander
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
| | - Atul Malhotra
- Division of Pulmonary, Critical Care, and Sleep Medicine, Department of Medicine, University of California, San Diego, CA USA
| | - Satchidananda Panda
- Laboratory of Regulatory Biology, Salk Institute of Biological Studies, La Jolla, CA, USA
| | - Christopher W. Benner
- Division of Endocrinology and Metabolism, Department of Medicine, University of California, San Diego, CA, USA
| | - Nicole G. Coufal
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, USA
- Department of Pediatrics, University of California, San Diego, CA, USA
- Rady Children’s Hospital, San Diego, CA
| |
Collapse
|
37
|
Mehl T, Gruenstaeudl M. airpg: automatically accessing the inverted repeats of archived plastid genomes. BMC Bioinformatics 2021; 22:413. [PMID: 34418956 PMCID: PMC8379869 DOI: 10.1186/s12859-021-04309-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 07/26/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In most flowering plants, the plastid genome exhibits a quadripartite genome structure, comprising a large and a small single copy as well as two inverted repeat regions. Thousands of plastid genomes have been sequenced and submitted to public sequence repositories in recent years. The quality of sequence annotations in many of these submissions is known to be problematic, especially regarding annotations that specify the length and location of the inverted repeats: such annotations are either missing or portray the length or location of the repeats incorrectly. However, many biological investigations employ publicly available plastid genomes at face value and implicitly assume the correctness of their sequence annotations. RESULTS We introduce airpg, a Python package that automatically assesses the frequency of incomplete or incorrect annotations of the inverted repeats among publicly available plastid genomes. Specifically, the tool automatically retrieves plastid genomes from NCBI Nucleotide under variable search parameters, surveys them for length and location specifications of inverted repeats, and confirms any inverted repeat annotations through self-comparisons of the genome sequences. The package also includes functionality for automatic identification and removal of duplicate genome records and accounts for taxa that genuinely lack inverted repeats. A survey of the presence of inverted repeat annotations among all plastid genomes of flowering plants submitted to NCBI Nucleotide until the end of 2020 using airpg, followed by a statistical analysis of potential associations with record metadata, highlights that release year and publication status of the genome records have a significant effect on the frequency of complete and equal-length inverted repeat annotations. CONCLUSION The number of plastid genomes on NCBI Nucleotide has increased dramatically in recent years, and many more genomes will likely be submitted over the next decade. airpg enables researchers to automatically access and evaluate the inverted repeats of these plastid genomes as well as their sequence annotations and, thus, contributes to increasing the reliability of publicly available plastid genomes. The software is freely available via the Python package index at http://pypi.python.org/pypi/airpg .
Collapse
Affiliation(s)
- Tilman Mehl
- Institut für Bioinformatik, Freie Universität Berlin, 14195 Berlin, Germany
| | | |
Collapse
|
38
|
Mohanta TK, Mishra AK, Khan A, Hashem A, Abd_Allah EF, Al-Harrasi A. Gene Loss and Evolution of the Plastome. Genes (Basel) 2020; 11:E1133. [PMID: 32992972 PMCID: PMC7650654 DOI: 10.3390/genes11101133] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Revised: 09/07/2020] [Accepted: 09/14/2020] [Indexed: 12/13/2022] Open
Abstract
Chloroplasts are unique organelles within the plant cells and are responsible for sustaining life forms on the earth due to their ability to conduct photosynthesis. Multiple functional genes within the chloroplast are responsible for a variety of metabolic processes that occur in the chloroplast. Considering its fundamental role in sustaining life on the earth, it is important to identify the level of diversity present in the chloroplast genome, what genes and genomic content have been lost, what genes have been transferred to the nuclear genome, duplication events, and the overall origin and evolution of the chloroplast genome. Our analysis of 2511 chloroplast genomes indicated that the genome size and number of coding DNA sequences (CDS) in the chloroplasts genome of algae are higher relative to other lineages. Approximately 10.31% of the examined species have lost the inverted repeats (IR) in the chloroplast genome that span across all the lineages. Genome-wide analyses revealed the loss of the Rbcl gene in parasitic and heterotrophic plants occurred approximately 56 Ma ago. PsaM, Psb30, ChlB, ChlL, ChlN, and Rpl21 were found to be characteristic signature genes of the chloroplast genome of algae, bryophytes, pteridophytes, and gymnosperms; however, none of these genes were found in the angiosperm or magnoliid lineage which appeared to have lost them approximately 203-156 Ma ago. A variety of chloroplast-encoded genes were lost across different species lineages throughout the evolutionary process. The Rpl20 gene, however, was found to be the most stable and intact gene in the chloroplast genome and was not lost in any of the analyzed species, suggesting that it is a signature gene of the plastome. Our evolutionary analysis indicated that chloroplast genomes evolved from multiple common ancestors ~1293 Ma ago and have undergone vivid recombination events across different taxonomic lineages.
Collapse
Affiliation(s)
- Tapan Kumar Mohanta
- Biotech and Omics Laboratory, Natural and Medical Sciences Research Centre, University of Nizwa, Nizwa 616, Oman;
| | | | - Adil Khan
- Biotech and Omics Laboratory, Natural and Medical Sciences Research Centre, University of Nizwa, Nizwa 616, Oman;
| | - Abeer Hashem
- Botany and Microbiology Department, College of Science, King Saud University, Riyadh 11451, Saudi Arabia;
- Mycology and Plant Disease Survey Department, Plant Pathology Research Institute, Giza 12511, Egypt
| | - Elsayed Fathi Abd_Allah
- Plant Production Department, College of Food and Agricultural Sciences, King Saud University, P.O. Box. 2460, Riyadh 11451, Saudi Arabia;
| | - Ahmed Al-Harrasi
- Natural Product Laboratory, Natural and Medical Sciences Research Centre, University of Nizwa, Nizwa 616, Oman
| |
Collapse
|