1
|
A global survey of prokaryotic genomes reveals the eco-evolutionary pressures driving horizontal gene transfer. Nat Ecol Evol 2024; 8:986-998. [PMID: 38443606 DOI: 10.1038/s41559-024-02357-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2023] [Accepted: 02/05/2024] [Indexed: 03/07/2024]
Abstract
Horizontal gene transfer, the exchange of genetic material through means other than reproduction, is a fundamental force in prokaryotic genome evolution. Genomic persistence of horizontally transferred genes has been shown to be influenced by both ecological and evolutionary factors. However, there is limited availability of ecological information about species other than the habitats from which they were isolated, which has prevented a deeper exploration of ecological contributions to horizontal gene transfer. Here we focus on transfers detected through comparison of individual gene trees to the species tree, assessing the distribution of gene-exchanging prokaryotes across over a million environmental sequencing samples. By analysing detected horizontal gene transfer events, we show distinct functional profiles for recent versus old events. Although most genes transferred are part of the accessory genome, genes transferred earlier in evolution tend to be more ubiquitous within present-day species. We find that co-occurring, interacting and high-abundance species tend to exchange more genes. Finally, we show that host-associated specialist species are most likely to exchange genes with other host-associated specialist species, whereas species found across different habitats have similar gene exchange rates irrespective of their preferred habitat. Our study covers an unprecedented scale of integrated horizontal gene transfer and environmental information, highlighting broad eco-evolutionary trends.
Collapse
|
2
|
Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration. Sci Rep 2024; 14:6009. [PMID: 38472223 PMCID: PMC10933411 DOI: 10.1038/s41598-024-55655-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.
Collapse
|
3
|
Identification of HDV-like theta ribozymes involved in tRNA-based recoding of gut bacteriophages. Nat Commun 2024; 15:1559. [PMID: 38378708 PMCID: PMC10879173 DOI: 10.1038/s41467-024-45653-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 01/29/2024] [Indexed: 02/22/2024] Open
Abstract
Trillions of microorganisms, collectively known as the microbiome, inhabit our bodies with the gut microbiome being of particular interest in biomedical research. Bacteriophages, the dominant virome constituents, can utilize suppressor tRNAs to switch to alternative genetic codes (e.g., the UAG stop-codon is reassigned to glutamine) while infecting hosts with the standard bacterial code. However, what triggers this switch and how the bacteriophage manipulates its host is poorly understood. Here, we report the discovery of a subgroup of minimal hepatitis delta virus (HDV)-like ribozymes - theta ribozymes - potentially involved in the code switch leading to the expression of recoded lysis and structural phage genes. We demonstrate their HDV-like self-scission behavior in vitro and find them in an unreported context often located with their cleavage site adjacent to tRNAs, indicating a role in viral tRNA maturation and/or regulation. Every fifth associated tRNA is a suppressor tRNA, further strengthening our hypothesis. The vast abundance of tRNA-associated theta ribozymes - we provide 1753 unique examples - highlights the importance of small ribozymes as an alternative to large enzymes that usually process tRNA 3'-ends. Our discovery expands the short list of biological functions of small HDV-like ribozymes and introduces a previously unknown player likely involved in the code switch of certain recoded gut bacteriophages.
Collapse
|
4
|
The SIB Swiss Institute of Bioinformatics Semantic Web of data. Nucleic Acids Res 2024; 52:D44-D51. [PMID: 37878411 PMCID: PMC10767860 DOI: 10.1093/nar/gkad902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 10/02/2023] [Accepted: 10/05/2023] [Indexed: 10/27/2023] Open
Abstract
The SIB Swiss Institute of Bioinformatics (https://www.sib.swiss/) is a federation of bioinformatics research and service groups. The international life science community in academia and industry has been accessing the freely available databases provided by SIB since its inception in 1998. In this paper we present the 11 databases which currently offer semantically enriched data in accordance with the FAIR principles (Findable, Accessible, Interoperable, Reusable), as well as the Swiss Personalized Health Network initiative (SPHN) which also employs this enrichment. The semantic enrichment facilitates the manipulation of large data sets from public databases and private data sets. Examples are provided to illustrate that the data from the SIB databases can not only be queried using precise criteria individually, but also across multiple databases, including a variety of non-SIB databases. Data manipulation, be it exploration, extraction, annotation, combination, and publication, is possible using the SPARQL query language. Providing documentation, tutorials and sample queries makes it easier to navigate this web of semantic data. Through this paper, the reader will discover how the existing SIB knowledge graphs can be leveraged to tackle the complex biological or clinical questions that are being addressed today.
Collapse
|
5
|
PaxDb 5.0: Curated Protein Quantification Data Suggests Adaptive Proteome Changes in Yeasts. Mol Cell Proteomics 2023; 22:100640. [PMID: 37659604 PMCID: PMC10551891 DOI: 10.1016/j.mcpro.2023.100640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 08/25/2023] [Accepted: 08/30/2023] [Indexed: 09/04/2023] Open
Abstract
The "Protein Abundances Across Organisms" database (PaxDb) is an integrative metaresource dedicated to protein abundance levels, in tissue-specific or whole-organism proteomes. PaxDb focuses on computing best-estimate abundances for proteins in normal/healthy contexts and expresses abundance values for each protein in "parts per million" in relation to all other protein molecules in the cell. The uniform data reprocessing, quality scoring, and integrated orthology relations have made PaxDb one of the preferred tools for comparisons between individual datasets, tissues, or organisms. In describing the latest version 5.0 of PaxDb, we particularly emphasize the data integration from various types of raw data and how we expanded the number of organisms and tissue groups as well as the proteome coverage. The current collection of PaxDb includes 831 original datasets from 170 species, including 22 Archaea, 81 Bacteria, and 67 Eukaryota. Apart from detailing the data update, we also present a comparative analysis of the human proteome subset of PaxDb against the two most widely used human proteome data resources: Human Protein Atlas and Genotype-Tissue Expression. Lastly, through our protein abundance data, we reveal an evolutionary trend in the usage of sulfur-containing amino acids in the proteomes of Fungi.
Collapse
|
6
|
Chemotaxis and autoinducer-2 signalling mediate colonization and contribute to co-existence of Escherichia coli strains in the murine gut. Nat Microbiol 2023; 8:204-217. [PMID: 36624229 DOI: 10.1038/s41564-022-01286-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Accepted: 11/09/2022] [Indexed: 01/11/2023]
Abstract
Bacteria communicate and coordinate their behaviour at the intra- and interspecies levels by producing and sensing diverse extracellular small molecules called autoinducers. Autoinducer 2 (AI-2) is produced and detected by a variety of bacteria and thus plays an important role in interspecies communication and chemotaxis. Although AI-2 is a major autoinducer molecule present in the mammalian gut and can influence the composition of the murine gut microbiota, its role in bacteria-bacteria and bacteria-host interactions during gut colonization remains unclear. Combining competitive infections in C57BL/6 mice with microscopy and bioinformatic approaches, we show that chemotaxis (cheY) and AI-2 signalling (via lsrB) promote gut colonization by Escherichia coli, which is in turn connected to the ability of the bacteria to utilize fructoselysine (frl operon). We further show that the genomic diversity of E. coli strains with respect to AI-2 signalling allows ecological niche segregation and stable co-existence of different E. coli strains in the mammalian gut.
Collapse
|
7
|
proGenomes3: approaching one million accurately and consistently annotated high-quality prokaryotic genomes. Nucleic Acids Res 2023; 51:D760-D766. [PMID: 36408900 PMCID: PMC9825469 DOI: 10.1093/nar/gkac1078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 10/15/2022] [Accepted: 11/07/2022] [Indexed: 11/22/2022] Open
Abstract
The interpretation of genomic, transcriptomic and other microbial 'omics data is highly dependent on the availability of well-annotated genomes. As the number of publicly available microbial genomes continues to increase exponentially, the need for quality control and consistent annotation is becoming critical. We present proGenomes3, a database of 907 388 high-quality genomes containing 4 billion genes that passed stringent criteria and have been consistently annotated using multiple functional and taxonomic databases including mobile genetic elements and biosynthetic gene clusters. proGenomes3 encompasses 41 171 species-level clusters, defined based on universal single copy marker genes, for which pan-genomes and contextual habitat annotations are provided. The database is available at http://progenomes.embl.de/.
Collapse
|
8
|
CanIsoNet: a database to study the functional impact of isoform switching events in diseases. BIOINFORMATICS ADVANCES 2023; 3:vbad050. [PMID: 37123454 PMCID: PMC10133402 DOI: 10.1093/bioadv/vbad050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 03/30/2023] [Accepted: 04/13/2023] [Indexed: 05/02/2023]
Abstract
Motivation Alternative splicing, as an essential regulatory mechanism in normal mammalian cells, is frequently disturbed in cancer and other diseases. Switches in the expression of most dominant alternative isoforms can alter protein interaction networks of associated genes giving rise to disease and disease progression. Here, we present CanIsoNet, a database to view, browse and search isoform switching events in diseases. CanIsoNet is the first webserver that incorporates isoform expression data with STRING interaction networks and ClinVar annotations to predict the pathogenic impact of isoform switching events in various diseases. Results Data in CanIsoNet can be browsed by disease or searched by genes or isoforms in annotation-rich data tables. Various annotations for 11 811 isoforms and 14 357 unique isoform switching events across 31 different disease types are available. The network density score for each disease-specific isoform, PFAM domain IDs of disrupted interactions, domain structure visualization of transcripts and expression data of switched isoforms for each sample is given. Additionally, the genes annotated in ClinVar are highlighted in interactive interaction networks. Availability and implementation CanIsoNet is freely available at https://www.caniso.net. The source codes can be found under a Creative Common License at https://github.com/kahramanlab/CanIsoNet_Web. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
9
|
|
10
|
Author Correction: Genomic basis for RNA alterations in cancer. Nature 2023; 614:E37. [PMID: 36697831 PMCID: PMC9931574 DOI: 10.1038/s41586-022-05596-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
11
|
Abstract
Biological networks are often used to represent complex biological systems, which can contain several types of entities. Analysis and visualization of such networks is supported by the Cytoscape software tool and its many apps. While earlier versions of stringApp focused on providing intraspecies protein-protein interactions from the STRING database, the new stringApp 2.0 greatly improves the support for heterogeneous networks. Here, we highlight new functionality that makes it possible to create networks that contain proteins and interactions from STRING as well as other biological entities and associations from other sources. We exemplify this by complementing a published SARS-CoV-2 interactome with interactions from STRING. We have also extended stringApp with new data and query functionality for protein-protein interactions between eukaryotic parasites and their hosts. We show how this can be used to retrieve and visualize a cross-species network for a malaria parasite, its host, and its vector. Finally, the latest stringApp version has an improved user interface, allows retrieval of both functional associations and physical interactions, and supports group-wise enrichment analysis of different parts of a network to aid biological interpretation. stringApp is freely available at https://apps.cytoscape.org/apps/stringapp.
Collapse
|
12
|
Author Correction: Pathway and network analysis of more than 2500 whole cancer genomes. Nat Commun 2022; 13:7566. [PMID: 36481610 PMCID: PMC9732045 DOI: 10.1038/s41467-022-32334-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
|
13
|
Systematic assessment of pathway databases, based on a diverse collection of user-submitted experiments. Brief Bioinform 2022; 23:6695266. [PMID: 36088548 PMCID: PMC9487593 DOI: 10.1093/bib/bbac355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Revised: 07/13/2022] [Accepted: 07/30/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
A knowledge-based grouping of genes into pathways or functional units is essential for describing and understanding cellular complexity. However, it is not always clear a priori how and at what level of specificity functionally interconnected genes should be partitioned into pathways, for a given application. Here, we assess and compare nine existing and two conceptually novel functional classification systems, with respect to their discovery power and generality in gene set enrichment testing. We base our assessment on a collection of nearly 2000 functional genomics datasets provided by users of the STRING database. With these real-life and diverse queries, we assess which systems typically provide the most specific and complete enrichment results. We find many structural and performance differences between classification systems. Overall, the well-established, hierarchically organized pathway annotation systems yield the best enrichment performance, despite covering substantial parts of the human genome in general terms only. On the other hand, the more recent unsupervised annotation systems perform strongest in understudied areas and organisms, and in detecting more specific pathways, albeit with less informative labels.
Collapse
|
14
|
Sequence-Specific Features of Short Double-Strand, Blunt-End RNAs Have RIG-I- and Type 1 Interferon-Dependent or -Independent Anti-Viral Effects. Viruses 2022; 14:v14071407. [PMID: 35891387 PMCID: PMC9322957 DOI: 10.3390/v14071407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 06/17/2022] [Accepted: 06/23/2022] [Indexed: 02/08/2023] Open
Abstract
Pathogen-associated molecular patterns, including cytoplasmic DNA and double-strand (ds)RNA trigger the induction of interferon (IFN) and antiviral states protecting cells and organisms from pathogens. Here we discovered that the transfection of human airway cell lines or non-transformed fibroblasts with 24mer dsRNA mimicking the cellular micro-RNA (miR)29b-1* gives strong anti-viral effects against human adenovirus type 5 (AdV-C5), influenza A virus X31 (H3N2), and SARS-CoV-2. These anti-viral effects required blunt-end complementary RNA strands and were not elicited by corresponding single-strand RNAs. dsRNA miR-29b-1* but not randomized miR-29b-1* mimics induced IFN-stimulated gene expression, and downregulated cell adhesion and cell cycle genes, as indicated by transcriptomics and IFN-I responsive Mx1-promoter activity assays. The inhibition of AdV-C5 infection with miR-29b-1* mimic depended on the IFN-alpha receptor 2 (IFNAR2) and the RNA-helicase retinoic acid-inducible gene I (RIG-I) but not cytoplasmic RNA sensors MDA5 and ZNFX1 or MyD88/TRIF adaptors. The antiviral effects of miR29b-1* were independent of a central AUAU-motif inducing dsRNA bending, as mimics with disrupted AUAU-motif were anti-viral in normal but not RIG-I knock-out (KO) or IFNAR2-KO cells. The screening of a library of scrambled short dsRNA sequences identified also anti-viral mimics functioning independently of RIG-I and IFNAR2, thus exemplifying the diverse anti-viral mechanisms of short blunt-end dsRNAs.
Collapse
|
15
|
Abstract
Acidobacteria occur in a large variety of ecosystems worldwide and are particularly abundant and highly diverse in soils. In spite of their diversity, only few species have been characterized to date which makes Acidobacteria one of the most poorly understood phyla among the domain Bacteria. We used a culture-independent niche modeling approach to elucidate ecological adaptations and their evolution for 4,154 operational taxonomic units (OTUs) of Acidobacteria across 150 different, comprehensively characterized grassland soils in Germany. Using the relative abundances of their 16S rRNA gene transcripts, the responses of active OTUs along gradients of 41 environmental variables were modeled using hierarchical logistic regression (HOF), which allowed to determine values for optimum activity for each variable (niche optima). By linking 16S rRNA transcripts to the phylogeny of full 16S rRNA gene sequences, we could trace the evolution of the different ecological adaptations during the diversification of Acidobacteria. This approach revealed a pronounced ecological diversification even among acidobacterial sister clades. Although the evolution of habitat adaptation was mainly cladogenic, it was disrupted by recurrent events of convergent evolution that resulted in frequent habitat switching within individual clades. Our findings indicate that the high diversity of soil acidobacterial communities is largely sustained by differential habitat adaptation even at the level of closely related species. A comparison of niche optima of individual OTUs with the phenotypic properties of their cultivated representatives showed that our niche modeling approach (1) correctly predicts those physiological properties that have been determined for cultivated species of Acidobacteria but (2) also provides ample information on ecological adaptations that cannot be inferred from standard taxonomic descriptions of bacterial isolates. These novel information on specific adaptations of not-yet-cultivated Acidobacteria can therefore guide future cultivation trials and likely will increase their cultivation success.
Collapse
|
16
|
Probing Isoform Switching Events in Various Cancer Types: Lessons From Pan-Cancer Studies. Front Mol Biosci 2021; 8:726902. [PMID: 34888349 PMCID: PMC8650491 DOI: 10.3389/fmolb.2021.726902] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 11/01/2021] [Indexed: 12/03/2022] Open
Abstract
Alternative splicing is an essential regulatory mechanism for gene expression in mammalian cells contributing to protein, cellular, and species diversity. In cancer, alternative splicing is frequently disturbed, leading to changes in the expression of alternatively spliced protein isoforms. Advances in sequencing technologies and analysis methods led to new insights into the extent and functional impact of disturbed alternative splicing events. In this review, we give a brief overview of the molecular mechanisms driving alternative splicing, highlight the function of alternative splicing in healthy tissues and describe how alternative splicing is disrupted in cancer. We summarize current available computational tools for analyzing differential transcript usage, isoform switching events, and the pathogenic impact of cancer-specific splicing events. Finally, the strategies of three recent pan-cancer studies on isoform switching events are compared. Their methodological similarities and discrepancies are highlighted and lessons learned from the comparison are listed. We hope that our assessment will lead to new and more robust methods for cancer-specific transcript detection and help to produce more accurate functional impact predictions of isoform switching events.
Collapse
|
17
|
Correction to 'The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets'. Nucleic Acids Res 2021; 49:10800. [PMID: 34530444 PMCID: PMC8501959 DOI: 10.1093/nar/gkab835] [Citation(s) in RCA: 183] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
18
|
The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res 2021; 49:D605-D612. [PMID: 33237311 PMCID: PMC7779004 DOI: 10.1093/nar/gkaa1074] [Citation(s) in RCA: 3471] [Impact Index Per Article: 1157.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/20/2020] [Accepted: 11/23/2020] [Indexed: 12/19/2022] Open
Abstract
Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein–protein interactions are particularly important due to their versatility, specificity and adaptability. The STRING database aims to integrate all known and predicted associations between proteins, including both physical interactions as well as functional associations. To achieve this, STRING collects and scores evidence from a number of sources: (i) automated text mining of the scientific literature, (ii) databases of interaction experiments and annotated complexes/pathways, (iii) computational interaction predictions from co-expression and from conserved genomic context and (iv) systematic transfers of interaction evidence from one organism to another. STRING aims for wide coverage; the upcoming version 11.5 of the resource will contain more than 14 000 organisms. In this update paper, we describe changes to the text-mining system, a new scoring-mode for physical interactions, as well as extensive user interface features for customizing, extending and sharing protein networks. In addition, we describe how to query STRING with genome-wide, experimental data, including the automated detection of enriched functionalities and potential biases in the user's query data. The STRING resource is available online, at https://string-db.org/.
Collapse
|
19
|
Pathogenic impact of transcript isoform switching in 1,209 cancer samples covering 27 cancer types using an isoform-specific interaction network. Sci Rep 2020; 10:14453. [PMID: 32879328 PMCID: PMC7468103 DOI: 10.1038/s41598-020-71221-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 07/17/2020] [Indexed: 01/01/2023] Open
Abstract
Under normal conditions, cells of almost all tissue types express the same predominant canonical transcript isoform at each gene locus. In cancer, however, splicing regulation is often disturbed, leading to cancer-specific switches in the most dominant transcripts (MDT). To address the pathogenic impact of these switches, we have analyzed isoform-specific protein-protein interaction disruptions in 1,209 cancer samples covering 27 different cancer types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project of the International Cancer Genomics Consortium (ICGC). Our study revealed large variations in the number of cancer-specific MDT (cMDT) with the highest frequency in cancers of female reproductive organs. Interestingly, in contrast to the mutational load, cancers arising from the same primary tissue had a similar number of cMDT. Some cMDT were found in 100% of all samples in a cancer type, making them candidates for diagnostic biomarkers. cMDT tend to be located at densely populated network regions where they disrupted protein interactions in the proximity of pathogenic cancer genes. A gene ontology enrichment analysis showed that these disruptions occurred mostly in protein translation and RNA splicing pathways. Interestingly, samples with mutations in the spliceosomal complex tend to have higher number of cMDT, while other transcript expressions correlated with mutations in non-coding splice-site and promoter regions of their genes. This work demonstrates for the first time the large extent of cancer-specific alterations in alternative splicing for 27 different cancer types. It highlights distinct and common patterns of cMDT and suggests novel pathogenic transcripts and markers that induce large network disruptions in cancers.
Collapse
|
20
|
Fermentation Ability of Gut Microbiota of Wild Japanese Macaques in the Highland and Lowland Yakushima: In Vitro Fermentation Assay and Genetic Analyses. MICROBIAL ECOLOGY 2020; 80:459-474. [PMID: 32328670 DOI: 10.1007/s00248-020-01515-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 04/13/2020] [Indexed: 06/11/2023]
Abstract
Wild Japanese macaques (Macaca fuscata Blyth) living in the highland and lowland areas of Yakushima are known to have different diets, with highland individuals consuming more leaves. We aim to clarify whether and how these differences in diet are also reflected by gut microbial composition and fermentation ability. Therefore, we conduct an in vitro fermentation assay using fresh feces from macaques as inoculum and dry leaf powder of Eurya japonica Thunb. as a substrate. Fermentation activity was higher for feces collected in the highland, as evidenced by higher gas and butyric acid production and lower pH. Genetic analysis indicated separation of highland and lowland in terms of both community structure and function of the gut microbiota. Comparison of feces and suspension after fermentation indicated that the community structure changed during fermentation, and the change was larger for lowland samples. Analysis of the 16S rRNA V3-V4 barcoding region of the gut microbiota showed that community structure was clearly clustered between the two areas. Furthermore, metagenomic analysis indicated separation by gene and pathway abundance patterns. Two pathways (glycogen biosynthesis I and D-galacturonate degradation I) were enriched in lowland samples, possibly related to the fruit-eating lifestyle in the lowland. Overall, we demonstrated that the more leaf-eating highland Japanese macaques harbor gut microbiota with higher leaf fermentation ability compared with the more fruit-eating lowland ones. Broad, non-specific taxonomic and functional gut microbiome differences suggest that this pattern may be driven by a complex interplay between many taxa and pathways rather than single functional traits.
Collapse
|
21
|
eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res 2020; 47:D309-D314. [PMID: 30418610 PMCID: PMC6324079 DOI: 10.1093/nar/gky1085] [Citation(s) in RCA: 1871] [Impact Index Per Article: 467.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 10/26/2018] [Indexed: 11/25/2022] Open
Abstract
eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de
Collapse
|
22
|
Abstract
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
Collapse
|
23
|
Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME JOURNAL 2020; 14:1247-1259. [PMID: 32047279 PMCID: PMC7174425 DOI: 10.1038/s41396-020-0600-z] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Revised: 01/21/2020] [Accepted: 01/27/2020] [Indexed: 12/04/2022]
Abstract
Microbial organisms inhabit virtually all environments and encompass a vast biological diversity. The pangenome concept aims to facilitate an understanding of diversity within defined phylogenetic groups. Hence, pangenomes are increasingly used to characterize the strain diversity of prokaryotic species. To understand the interdependence of pangenome features (such as the number of core and accessory genes) and to study the impact of environmental and phylogenetic constraints on the evolution of conspecific strains, we computed pangenomes for 155 phylogenetically diverse species (from ten phyla) using 7,000 high-quality genomes to each of which the respective habitats were assigned. Species habitat ubiquity was associated with several pangenome features. In particular, core-genome size was more important for ubiquity than accessory genome size. In general, environmental preferences had a stronger impact on pangenome evolution than phylogenetic inertia. Environmental preferences explained up to 49% of the variance for pangenome features, compared with 18% by phylogenetic inertia. This observation was robust when the dataset was extended to 10,100 species (59 phyla). The importance of environmental preferences was further accentuated by convergent evolution of pangenome features in a given habitat type across different phylogenetic clades. For example, the soil environment promotes expansion of pangenome size, while host-associated habitats lead to its reduction. Taken together, we explored the global principles of pangenome evolution, quantified the influence of habitat, and phylogenetic inertia on the evolution of pangenomes and identified criteria governing species ubiquity and habitat specificity.
Collapse
|
24
|
Abstract
The catalog of cancer driver mutations in protein-coding genes has greatly expanded in the past decade. However, non-coding cancer driver mutations are less well-characterized and only a handful of recurrent non-coding mutations, most notably TERT promoter mutations, have been reported. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancer across 38 tumor types, we perform multi-faceted pathway and network analyses of non-coding mutations across 2583 whole cancer genomes from 27 tumor types compiled by the ICGC/TCGA PCAWG project that was motivated by the success of pathway and network analyses in prioritizing rare mutations in protein-coding genes. While few non-coding genomic elements are recurrently mutated in this cohort, we identify 93 genes harboring non-coding mutations that cluster into several modules of interacting proteins. Among these are promoter mutations associated with reduced mRNA expression in TP53, TLE4, and TCF4. We find that biological processes had variable proportions of coding and non-coding mutations, with chromatin remodeling and proliferation pathways altered primarily by coding mutations, while developmental pathways, including Wnt and Notch, altered by both coding and non-coding mutations. RNA splicing is primarily altered by non-coding mutations in this cohort, and samples containing non-coding mutations in well-known RNA splicing factors exhibit similar gene expression signatures as samples with coding mutations in these genes. These analyses contribute a new repertoire of possible cancer genes and mechanisms that are altered by non-coding mutations and offer insights into additional cancer vulnerabilities that can be investigated for potential therapeutic treatments.
Collapse
|
25
|
Abstract
Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale1-3. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4-5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter4; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation5,6; analyses timings and patterns of tumour evolution7; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity8,9; and evaluates a range of more-specialized features of cancer genomes8,10-18.
Collapse
|
26
|
Abstract
The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available.
Collapse
|
27
|
Rapid Inference of Direct Interactions in Large-Scale Ecological Networks from Heterogeneous Microbial Sequencing Data. Cell Syst 2019; 9:286-296.e8. [PMID: 31542415 DOI: 10.1016/j.cels.2019.08.002] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Revised: 05/16/2019] [Accepted: 07/31/2019] [Indexed: 12/27/2022]
Abstract
The availability of large-scale metagenomic sequencing data can facilitate the understanding of microbial ecosystems in unprecedented detail. However, current computational methods for predicting ecological interactions are hampered by insufficient statistical resolution and limited computational scalability. They also do not integrate metadata, which can reduce the interpretability of predicted ecological patterns. Here, we present FlashWeave, a computational approach based on a flexible Probabilistic Graphical Model framework that integrates metadata and predicts direct microbial interactions from heterogeneous microbial abundance data sets with hundreds of thousands of samples. FlashWeave outperforms state-of-the-art methods on diverse benchmarking challenges in terms of runtime and accuracy. We use FlashWeave to analyze a cross-study data set of 69,818 publicly available human gut samples and produce, to the best of our knowledge, the largest and most diverse network of predicted, direct gastrointestinal microbial interactions to date. FlashWeave is freely available for download here: https://github.com/meringlab/FlashWeave.jl.
Collapse
|
28
|
Growth-restricting effects of siRNA transfections: a largely deterministic combination of off-target binding and hybridization-independent competition. Nucleic Acids Res 2019; 46:9309-9320. [PMID: 30215772 PMCID: PMC6182159 DOI: 10.1093/nar/gky798] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Accepted: 09/10/2018] [Indexed: 01/17/2023] Open
Abstract
Perturbation of gene expression by means of synthetic small interfering RNAs (siRNAs) is a powerful way to uncover gene function. However, siRNA technology suffers from sequence-specific off-target effects and from limitations in knock-down efficiency. In this study, we assess a further problem: unintended effects of siRNA transfections on cellular fitness/proliferation. We show that the nucleotide compositions of siRNAs at specific positions have reproducible growth-restricting effects on mammalian cells in culture. This is likely distinct from hybridization-dependent off-target effects, since each nucleotide residue is seen to be acting independently and additively. The effect is robust and reproducible across different siRNA libraries and also across various cell lines, including human and mouse cells. Analyzing the growth inhibition patterns in correlation to the nucleotide sequence of the siRNAs allowed us to build a predictor that can estimate growth-restricting effects for any arbitrary siRNA sequence. Competition experiments with co-transfected siRNAs further suggest that the growth-restricting effects might be linked to an oversaturation of the cellular miRNA machinery, thus disrupting endogenous miRNA functions at large. We caution that competition between siRNA molecules could complicate the interpretation of double-knockdown or epistasis experiments, and potential interactions with endogenous miRNAs can be a factor when assaying cell growth or viability phenotypes.
Collapse
|
29
|
Cross-Regulation between TDP-43 and Paraspeckles Promotes Pluripotency-Differentiation Transition. Mol Cell 2019; 74:951-965.e13. [PMID: 31047794 PMCID: PMC6561722 DOI: 10.1016/j.molcel.2019.03.041] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2018] [Revised: 02/12/2019] [Accepted: 03/28/2019] [Indexed: 01/22/2023]
Abstract
RNA-binding proteins (RBPs) and long non-coding RNAs (lncRNAs) are key regulators of gene expression, but their joint functions in coordinating cell fate decisions are poorly understood. Here we show that the expression and activity of the RBP TDP-43 and the long isoform of the lncRNA Neat1, the scaffold of the nuclear compartment "paraspeckles," are reciprocal in pluripotent and differentiated cells because of their cross-regulation. In pluripotent cells, TDP-43 represses the formation of paraspeckles by enhancing the polyadenylated short isoform of Neat1. TDP-43 also promotes pluripotency by regulating alternative polyadenylation of transcripts encoding pluripotency factors, including Sox2, which partially protects its 3' UTR from miR-21-mediated degradation. Conversely, paraspeckles sequester TDP-43 and other RBPs from mRNAs and promote exit from pluripotency and embryonic patterning in the mouse. We demonstrate that cross-regulation between TDP-43 and Neat1 is essential for their efficient regulation of a broad network of genes and, therefore, of pluripotency and differentiation.
Collapse
|
30
|
Analysis of the Human Kinome and Phosphatome by Mass Cytometry Reveals Overexpression-Induced Effects on Cancer-Related Signaling. Mol Cell 2019; 74:1086-1102.e5. [PMID: 31101498 PMCID: PMC6561723 DOI: 10.1016/j.molcel.2019.04.021] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 02/06/2019] [Accepted: 04/11/2019] [Indexed: 12/24/2022]
Abstract
Kinase and phosphatase overexpression drives tumorigenesis and drug resistance. We previously developed a mass-cytometry-based single-cell proteomics approach that enables quantitative assessment of overexpression effects on cell signaling. Here, we applied this approach in a human kinome- and phosphatome-wide study to assess how 649 individually overexpressed proteins modulated cancer-related signaling in HEK293T cells in an abundance-dependent manner. Based on these data, we expanded the functional classification of human kinases and phosphatases and showed that the overexpression effects include non-catalytic roles. We detected 208 previously unreported signaling relationships. The signaling dynamics analysis indicated that the overexpression of ERK-specific phosphatases sustains proliferative signaling. This suggests a phosphatase-driven mechanism of cancer progression. Moreover, our analysis revealed a drug-resistant mechanism through which overexpression of tyrosine kinases, including SRC, FES, YES1, and BLK, induced MEK-independent ERK activation in melanoma A375 cells. These proteins could predict drug sensitivity to BRAF-MEK concurrent inhibition in cells carrying BRAF mutations.
Collapse
|
31
|
Protein tyrosine phosphatase non-receptor type 22 modulates colitis in a microbiota-dependent manner. J Clin Invest 2019; 129:2527-2541. [PMID: 31107248 DOI: 10.1172/jci123263] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Accepted: 04/02/2019] [Indexed: 12/16/2022] Open
Abstract
The gut microbiota is crucial for our health, and well-balanced interactions between the host's immune system and the microbiota are essential to prevent chronic intestinal inflammation, as observed in inflammatory bowel diseases (IBD). A variant in protein tyrosine phosphatase non-receptor type 22 (PTPN22) is associated with reduced risk of developing IBD, but promotes the onset of autoimmune disorders. While the role of PTPN22 in modulating molecular pathways involved in IBD pathogenesis is well studied, its impact on shaping the intestinal microbiota has not been addressed in depth. Here, we demonstrate that mice carrying the PTPN22 variant (619W mice) were protected from acute dextran sulfate sodium (DSS) colitis, but suffered from pronounced inflammation upon chronic DSS treatment. The basal microbiota composition was distinct between genotypes, and DSS-induced dysbiosis was milder in 619W mice than in WT littermates. Transfer of microbiota from 619W mice after the first DSS cycle into treatment-naive 619W mice promoted colitis, indicating that changes in microbial composition enhanced chronic colitis in those animals. This indicates that presence of the PTPN22 variant affects intestinal inflammation by modulating the host's response to the intestinal microbiota.
Collapse
|
32
|
Tree reconciliation combined with subsampling improves large scale inference of orthologous group hierarchies. BMC Bioinformatics 2019; 20:228. [PMID: 31060495 PMCID: PMC6501302 DOI: 10.1186/s12859-019-2828-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 04/17/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An orthologous group (OG) comprises a set of orthologous and paralogous genes that share a last common ancestor (LCA). OGs are defined with respect to a chosen taxonomic level, which delimits the position of the LCA in time to a specified speciation event. A hierarchy of OGs expands on this notion, connecting more general OGs, distant in time, to more recent, fine-grained OGs, thereby spanning multiple levels of the tree of life. Large scale inference of OG hierarchies with independently computed taxonomic levels can suffer from inconsistencies between successive levels, such as the position in time of a duplication event. This can be due to confounding genetic signal or algorithmic limitations. Importantly, inconsistencies limit the potential use of OGs for functional annotation and third-party applications. RESULTS Here we present a new methodology to ensure hierarchical consistency of OGs across taxonomic levels. To resolve an inconsistency, we subsample the protein space of the OG members and perform gene tree-species tree reconciliation for each sampling. Differently from previous approaches, by subsampling the protein space, we avoid the notoriously difficult task of accurately building and reconciling very large phylogenies. We implement the method into a high-throughput pipeline and apply it to the eggNOG database. We use independent protein domain definitions to validate its performance. CONCLUSION The presented consistency pipeline shows that, contrary to previous limitations, tree reconciliation can be a useful instrument for the construction of OG hierarchies. The key lies in the combination of sampling smaller trees and aggregating their reconciliations for robustness. Results show comparable or greater performance to previous pipelines. The code is available on Github at: https://github.com/meringlab/og_consistency_pipeline .
Collapse
|
33
|
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 2019. [PMID: 30476243 DOI: 10.1093/nar/gyk1131] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/04/2023] Open
Abstract
Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.
Collapse
|
34
|
STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res 2019. [PMID: 30476243 DOI: 10.1093/nar/gky1131.] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Proteins and their functional interactions form the backbone of the cellular machinery. Their connectivity network needs to be considered for the full understanding of biological phenomena, but the available information on protein-protein associations is incomplete and exhibits varying levels of annotation granularity and reliability. The STRING database aims to collect, score and integrate all publicly available sources of protein-protein interaction information, and to complement these with computational predictions. Its goal is to achieve a comprehensive and objective global network, including direct (physical) as well as indirect (functional) interactions. The latest version of STRING (11.0) more than doubles the number of organisms it covers, to 5090. The most important new feature is an option to upload entire, genome-wide datasets as input, allowing users to visualize subsets as interaction networks and to perform gene-set enrichment analysis on the entire input. For the enrichment analysis, STRING implements well-known classification systems such as Gene Ontology and KEGG, but also offers additional, new classification systems based on high-throughput text-mining as well as on a hierarchical clustering of the association network itself. The STRING resource is available online at https://string-db.org/.
Collapse
|
35
|
Ecologically informed microbial biomarkers and accurate classification of mixed and unmixed samples in an extensive cross-study of human body sites. MICROBIOME 2018; 6:192. [PMID: 30355348 PMCID: PMC6201589 DOI: 10.1186/s40168-018-0565-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/20/2018] [Accepted: 09/28/2018] [Indexed: 06/02/2023]
Abstract
BACKGROUND The identification of body site-specific microbial biomarkers and their use for classification tasks have promising applications in medicine, microbial ecology, and forensics. Previous studies have characterized site-specific microbiota and shown that sample origin can be accurately predicted by microbial content. However, these studies were usually restricted to single datasets with consistent experimental methods and conditions, as well as comparatively small sample numbers. The effects of study-specific biases and statistical power on classification performance and biomarker identification thus remain poorly understood. Furthermore, reliable detection in mixtures of different body sites or with noise from environmental contamination has rarely been investigated thus far. Finally, the impact of ecological associations between microbes on biomarker discovery was usually not considered in previous work. RESULTS Here we present the analysis of one of the largest cross-study sequencing datasets of microbial communities from human body sites (15,082 samples from 57 publicly available studies). We show that training a Random Forest Classifier on this aggregated dataset increases prediction performance for body sites by 35% compared to a single-study classifier. Using simulated datasets, we further demonstrate that the source of different microbial contributions in mixtures of different body sites or with soil can be detected starting at 1% of the total microbial community. We apply a biomarker selection method that excludes indirect environmental associations driven by microbe-microbe associations, yielding a parsimonious set of highly predictive taxa including novel biomarkers and excluding many previously reported taxa. We find a considerable fraction of unclassified biomarkers ("microbial dark matter") and observe that negatively associated taxa have a surprisingly high impact on classification performance. We further detect a significant enrichment of rod-shaped, motile, and sporulating taxa for feces biomarkers, consistent with a highly competitive environment. CONCLUSIONS Our machine learning model shows strong body site classification performance, both in single-source samples and mixtures, making it promising for tasks requiring high accuracy, such as forensic applications. We report a core set of ecologically informed biomarkers, inferred across a wide range of experimental protocols and conditions, providing the most concise, general, and least biased overview of body site-associated microbes to date.
Collapse
|
36
|
MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics 2018; 33:3808-3810. [PMID: 28961926 PMCID: PMC5860325 DOI: 10.1093/bioinformatics/btx517] [Citation(s) in RCA: 65] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Accepted: 08/10/2017] [Indexed: 12/02/2022] Open
Abstract
Motivation Ribosomal RNA profiling has become crucial to studying microbial communities, but meaningful taxonomic analysis and inter-comparison of such data are still hampered by technical limitations, between-study design variability and inconsistencies between taxonomies used. Results Here we present MAPseq, a framework for reference-based rRNA sequence analysis that is up to 30% more accurate (F½ score) and up to one hundred times faster than existing solutions, providing in a single run multiple taxonomy classifications and hierarchical operational taxonomic unit mappings, for rRNA sequences in both amplicon and shotgun sequencing strategies, and for datasets of virtually any size. Availability and implementation Source code and binaries are freely available at https://github.com/jfmrod/mapseq Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
37
|
Abstract
A large proportion of biomedical research and the development of therapeutics is focused on a small fraction of the human genome. In a strategic effort to map the knowledge gaps around proteins encoded by the human genome and to promote the exploration of currently understudied, but potentially druggable, proteins, the US National Institutes of Health launched the Illuminating the Druggable Genome (IDG) initiative in 2014. In this article, we discuss how the systematic collection and processing of a wide array of genomic, proteomic, chemical and disease-related resource data by the IDG Knowledge Management Center have enabled the development of evidence-based criteria for tracking the target development level (TDL) of human proteins, which indicates a substantial knowledge deficit for approximately one out of three proteins in the human proteome. We then present spotlights on the TDL categories as well as key drug target classes, including G protein-coupled receptors, protein kinases and ion channels, which illustrate the nature of the unexplored opportunities for biomedical research and therapeutic development.
Collapse
|
38
|
Abstract
This corrects the article DOI: 10.1038/nrd.2018.14.
Collapse
|
39
|
High-Resolution RNA Maps Suggest Common Principles of Splicing and Polyadenylation Regulation by TDP-43. Cell Rep 2018; 19:1056-1067. [PMID: 28467899 PMCID: PMC5437728 DOI: 10.1016/j.celrep.2017.04.028] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Revised: 03/06/2017] [Accepted: 04/06/2017] [Indexed: 11/05/2022] Open
Abstract
Many RNA-binding proteins (RBPs) regulate both alternative exons and poly(A) site selection. To understand their regulatory principles, we developed expressRNA, a web platform encompassing computational tools for integration of iCLIP and RNA motif analyses with RNA-seq and 3′ mRNA sequencing. This reveals at nucleotide resolution the “RNA maps” describing how the RNA binding positions of RBPs relate to their regulatory functions. We use this approach to examine how TDP-43, an RBP involved in several neurodegenerative diseases, binds around its regulated poly(A) sites. Binding close to the poly(A) site generally represses, whereas binding further downstream enhances use of the site, which is similar to TDP-43 binding around regulated exons. Our RNAmotifs2 software also identifies sequence motifs that cluster together with the binding motifs of TDP-43. We conclude that TDP-43 directly regulates diverse types of pre-mRNA processing according to common position-dependent principles. TDP-43 regulates competing poly(A) sites in a highly position-dependent manner expressRNA is a new platform for analysis of alternative polyadenylation and splicing RNAmotifs2 is a cluster motif analysis platform integrated with expressRNA Regulation of pre-mRNA processing might follow common position-dependent principles
Collapse
|
40
|
Abstract
Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines are relatively inaccessible (e.g., new assignments only available through database updates), less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. To validate our method, we benchmarked Gene Ontology (GO) predictions against two widely used homology-based approaches: BLAST and InterProScan. Orthology filters applied to BLAST results reduced the rate of false positive assignments by 11%, and increased the ratio of experimentally validated terms recovered over all terms assigned per protein by 15%. Compared with InterProScan, eggNOG-mapper achieved similar proteome coverage and precision while predicting, on average, 41 more terms per protein and increasing the rate of experimentally validated terms recovered over total term assignments per protein by 35%. EggNOG-mapper predictions scored within the top-5 methods in the three GO categories using the CAFA2 NK-partial benchmark. Finally, we evaluated eggNOG-mapper for functional annotation of metagenomics data, yielding better performance than interProScan. eggNOG-mapper runs ∼15× faster than BLAST and at least 2.5× faster than InterProScan. The tool is available standalone and as an online service at http://eggnog-mapper.embl.de.
Collapse
|
41
|
Cell-wide analysis of protein thermal unfolding reveals determinants of thermostability. Science 2017; 355:355/6327/eaai7825. [PMID: 28232526 DOI: 10.1126/science.aai7825] [Citation(s) in RCA: 249] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2016] [Accepted: 01/12/2017] [Indexed: 12/14/2022]
Abstract
Temperature-induced cell death is thought to be due to protein denaturation, but the determinants of thermal sensitivity of proteomes remain largely uncharacterized. We developed a structural proteomic strategy to measure protein thermostability on a proteome-wide scale and with domain-level resolution. We applied it to Escherichia coli, Saccharomyces cerevisiae, Thermus thermophilus, and human cells, yielding thermostability data for more than 8000 proteins. Our results (i) indicate that temperature-induced cellular collapse is due to the loss of a subset of proteins with key functions, (ii) shed light on the evolutionary conservation of protein and domain stability, and (iii) suggest that natively disordered proteins in a cell are less prevalent than predicted and (iv) that highly expressed proteins are stable because they are designed to tolerate translational errors that would lead to the accumulation of toxic misfolded species.
Collapse
|
42
|
Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol Biol Evol 2017. [PMID: 28460117 DOI: 10.1093/molbev/msx148.] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines are relatively inaccessible (e.g., new assignments only available through database updates), less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. To validate our method, we benchmarked Gene Ontology (GO) predictions against two widely used homology-based approaches: BLAST and InterProScan. Orthology filters applied to BLAST results reduced the rate of false positive assignments by 11%, and increased the ratio of experimentally validated terms recovered over all terms assigned per protein by 15%. Compared with InterProScan, eggNOG-mapper achieved similar proteome coverage and precision while predicting, on average, 41 more terms per protein and increasing the rate of experimentally validated terms recovered over total term assignments per protein by 35%. EggNOG-mapper predictions scored within the top-5 methods in the three GO categories using the CAFA2 NK-partial benchmark. Finally, we evaluated eggNOG-mapper for functional annotation of metagenomics data, yielding better performance than interProScan. eggNOG-mapper runs ∼15× faster than BLAST and at least 2.5× faster than InterProScan. The tool is available standalone and as an online service at http://eggnog-mapper.embl.de.
Collapse
|
43
|
Effects of oral antibiotics and isotretinoin on the murine gut microbiota. Int J Antimicrob Agents 2017; 50:342-351. [PMID: 28689869 DOI: 10.1016/j.ijantimicag.2017.03.017] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Revised: 03/09/2017] [Accepted: 03/15/2017] [Indexed: 01/08/2023]
Abstract
Inflammatory bowel disease (IBD) may develop due to an immunogenic response to commensal gut microbiota triggered by environmental factors in the genetically susceptible host. Isotretinoin, applied in the treatment of severe acne, has been variably associated with IBD, but prior treatment with antibiotics, also associated with IBD development, confounds confirmation of this association. This study investigated the effects of doxycycline, metronidazole (frequently used in the treatment of acne and IBD, respectively) and isotretinoin on murine gut (faecal) microbiota after 2 weeks of treatment and after a 4-week recovery period. Faecal microbiota composition was assessed by 16S rRNA gene sequencing on the GS-FLX 454 platform with primers directed against the variable regions V1-V2. Doxycycline had a modest effect on bacterial richness and evenness, but had pronounced persistent and significant effects on the abundance of certain operational taxonomic units compared with the control group. In contrast, metronidazole induced a pronounced reduction in diversity after treatment, but these effects did not persist after the recovery period. This study demonstrates differential effects of antibiotics on the gut microbiota with doxycycline, unlike metronidazole, mediating long-term changes in the murine gut microbiota. Isotretinoin had no significant effect on the faecal microbiota.
Collapse
|
44
|
Sputum DNA sequencing in cystic fibrosis: non-invasive access to the lung microbiome and to pathogen details. MICROBIOME 2017; 5:20. [PMID: 28187782 PMCID: PMC5303297 DOI: 10.1186/s40168-017-0234-1] [Citation(s) in RCA: 78] [Impact Index Per Article: 11.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 01/24/2017] [Indexed: 05/17/2023]
Abstract
BACKGROUND Cystic fibrosis (CF) is a life-threatening genetic disorder, characterized by chronic microbial lung infections due to abnormally viscous mucus secretions within airways. The clinical management of CF typically involves regular respiratory-tract cultures in order to identify pathogens and to guide treatment. However, culture-based methods can miss atypical or slow-growing microbes. Furthermore, the isolated microbes are often not classified at the strain level due to limited taxonomic resolution. RESULTS Here, we show that untargeted metagenomic sequencing of sputum DNA can provide valuable information beyond the possibilities of culture-based diagnosis. We sequenced the sputum of six CF patients and eleven control samples (including healthy subjects and chronic obstructive pulmonary disease patients) without prior depletion of human DNA or cell size selection, thus obtaining the most unbiased and comprehensive characterization of CF respiratory tract microbes to date. We present detailed descriptions of the CF and healthy lung microbiome, reconstruct near complete pathogen genomes, and confirm that the CF lungs consistently exhibit reduced microbial diversity. Crucially, the obtained genomic sequences enabled a detailed identification of the exact pathogen strain types, when analyzed in conjunction with existing multi-locus sequence typing databases. We also detected putative pathogenicity islands and indicators of antibiotic resistance, in good agreement with independent clinical tests. CONCLUSIONS Unbiased sputum metagenomics provides an in-depth profile of the lung pathogen microbiome, which is complementary to and more detailed than standard culture-based reporting. Furthermore, functional and taxonomic features of the dominant pathogens, including antibiotics resistances, can be deduced-supporting accurate and non-invasive clinical diagnosis.
Collapse
|
45
|
RAIN: RNA-protein Association and Interaction Networks. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:baw167. [PMID: 28077569 PMCID: PMC5225963 DOI: 10.1093/database/baw167] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Revised: 11/18/2016] [Accepted: 12/05/2016] [Indexed: 12/11/2022]
Abstract
Protein association networks can be inferred from a range of resources including experimental data, literature mining and computational predictions. These types of evidence are emerging for non-coding RNAs (ncRNAs) as well. However, integration of ncRNAs into protein association networks is challenging due to data heterogeneity. Here, we present a database of ncRNA-RNA and ncRNA-protein interactions and its integration with the STRING database of protein-protein interactions. These ncRNA associations cover four organisms and have been established from curated examples, experimental data, interaction predictions and automatic literature mining. RAIN uses an integrative scoring scheme to assign a confidence score to each interaction. We demonstrate that RAIN outperforms the underlying microRNA-target predictions in inferring ncRNA interactions. RAIN can be operated through an easily accessible web interface and all interaction data can be downloaded.Database URL: http://rth.dk/resources/rain.
Collapse
|
46
|
The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res 2016; 45:D362-D368. [PMID: 27924014 PMCID: PMC5210637 DOI: 10.1093/nar/gkw937] [Citation(s) in RCA: 4718] [Impact Index Per Article: 589.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Accepted: 10/06/2016] [Indexed: 02/06/2023] Open
Abstract
A system-wide understanding of cellular function requires knowledge of all functional interactions between the expressed proteins. The STRING database aims to collect and integrate this information, by consolidating known and predicted protein–protein association data for a large number of organisms. The associations in STRING include direct (physical) interactions, as well as indirect (functional) interactions, as long as both are specific and biologically meaningful. Apart from collecting and reassessing available experimental data on protein–protein interactions, and importing known pathways and protein complexes from curated databases, interaction predictions are derived from the following sources: (i) systematic co-expression analysis, (ii) detection of shared selective signals across genomes, (iii) automated text-mining of the scientific literature and (iv) computational transfer of interaction knowledge between organisms based on gene orthology. In the latest version 10.5 of STRING, the biggest changes are concerned with data dissemination: the web frontend has been completely redesigned to reduce dependency on outdated browser technologies, and the database can now also be queried from inside the popular Cytoscape software framework. Further improvements include automated background analysis of user inputs for functional enrichments, and streamlined download options. The STRING resource is available online, at http://string-db.org/.
Collapse
|
47
|
IFN-γ Hinders Recovery from Mucosal Inflammation during Antibiotic Therapy for Salmonella Gut Infection. Cell Host Microbe 2016; 20:238-49. [PMID: 27453483 DOI: 10.1016/j.chom.2016.06.008] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2016] [Revised: 05/08/2016] [Accepted: 06/14/2016] [Indexed: 12/30/2022]
Abstract
Salmonella Typhimurium (S.Tm) causes acute enteropathy resolving after 4-7 days. Strikingly, antibiotic therapy does not accelerate disease resolution. We screened for factors blocking remission using a S.Tm enterocolitis model. The antibiotic ciprofloxacin clears pathogen stool loads within 3-24 hr, while gut pathology resolves more slowly (ψ50: ∼48 hr, remission: 6-9 days). This delayed resolution is mediated by an interferon-γ (IFN-γ)-dependent response that is triggered during acute infection and continues throughout therapy. Specifically, IFN-γ production by mucosal T and NK cells retards disease resolution by maintaining signaling through the transcriptional regulator STAT1 and boosting expression of inflammatory mediators like IL-1β, TNF, and iNOS. Additionally, sustained IFN-γ fosters phagocyte accumulation and hampers antimicrobial defense mediated by IL-22 and the lectin REGIIIβ. These findings reveal a role for IFN-γ in delaying resolution of intestinal inflammation and may inform therapies for acute Salmonella enteropathy, chronic inflammatory bowel diseases, or disease resolution during antibiotic treatment.
Collapse
|
48
|
Natural Genetic Variation Differentially Affects the Proteome and Transcriptome in Caenorhabditis elegans. Mol Cell Proteomics 2016; 15:1670-80. [PMID: 26944343 DOI: 10.1074/mcp.m115.052548] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Indexed: 11/06/2022] Open
Abstract
Natural genetic variation is the raw material of evolution and influences disease development and progression. An important question is how this genetic variation translates into variation in protein abundance. To analyze the effects of the genetic background on gene and protein expression in the nematode Caenorhabditis elegans, we quantitatively compared the two genetically highly divergent wild-type strains N2 and CB4856. Gene expression was analyzed by microarray assays, and proteins were quantified using stable isotope labeling by amino acids in cell culture. Among all transcribed genes, we found 1,532 genes to be differentially transcribed between the two wild types. Of the total 3,238 quantified proteins, 129 proteins were significantly differentially expressed between N2 and CB4856. The differentially expressed proteins were enriched for genes that function in insulin-signaling and stress-response pathways, underlining strong divergence of these pathways in nematodes. The protein abundance of the two wild-type strains correlates more strongly than protein abundance versus transcript abundance within each wild type. Our findings indicate that in C. elegans only a fraction of the changes in protein abundance can be explained by the changes in mRNA abundance. These findings corroborate with the observations made across species.
Collapse
|
49
|
eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res 2016; 44:D286-93. [PMID: 26582926 PMCID: PMC4702882 DOI: 10.1093/nar/gkv1248] [Citation(s) in RCA: 1338] [Impact Index Per Article: 167.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Revised: 10/30/2015] [Accepted: 11/02/2015] [Indexed: 01/19/2023] Open
Abstract
eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de.
Collapse
|
50
|
SVD-phy: improved prediction of protein functional associations through singular value decomposition of phylogenetic profiles. Bioinformatics 2015; 32:1085-7. [PMID: 26614125 PMCID: PMC4896368 DOI: 10.1093/bioinformatics/btv696] [Citation(s) in RCA: 71] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 11/24/2015] [Indexed: 11/15/2022] Open
Abstract
Summary: A successful approach for predicting functional associations between non-homologous genes is to compare their phylogenetic distributions. We have devised a phylogenetic profiling algorithm, SVD-Phy, which uses truncated singular value decomposition to address the problem of uninformative profiles giving rise to false positive predictions. Benchmarking the algorithm against the KEGG pathway database, we found that it has substantially improved performance over existing phylogenetic profiling methods. Availability and implementation: The software is available under the open-source BSD license at https://bitbucket.org/andrea/svd-phy Contact:lars.juhl.jensen@cpr.ku.dk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
|