1
|
Zorea A, Pellow D, Levin L, Pilosof S, Friedman J, Shamir R, Mizrahi I. Plasmids in the human gut reveal neutral dispersal and recombination that is overpowered by inflammatory diseases. Nat Commun 2024; 15:3147. [PMID: 38605009 PMCID: PMC11009399 DOI: 10.1038/s41467-024-47272-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 03/25/2024] [Indexed: 04/13/2024] Open
Abstract
Plasmids are pivotal in driving bacterial evolution through horizontal gene transfer. Here, we investigated 3467 human gut microbiome samples across continents and disease states, analyzing 11,086 plasmids. Our analyses reveal that plasmid dispersal is predominantly stochastic, indicating neutral processes as the primary driver of their wide distribution. We find that only 20-25% of plasmid DNA is being selected in various disease states, constraining its distribution across hosts. Selective pressures shape specific plasmid segments with distinct ecological functions, influenced by plasmid mobilization lifestyle, antibiotic usage, and inflammatory gut diseases. Notably, these elements are more commonly shared within groups of individuals with similar health conditions, such as Inflammatory Bowel Disease (IBD), regardless of geographic location across continents. These segments contain essential genes such as iron transport mechanisms- a distinctive gut signature of IBD that impacts the severity of inflammation. Our findings shed light on mechanisms driving plasmid dispersal and selection in the human gut, highlighting their role as carriers of vital gene pools impacting bacterial hosts and ecosystem dynamics.
Collapse
Affiliation(s)
- Alvah Zorea
- National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel
- Department of Life Sciences, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel
- The Goldman Sonnenfeldt School of Sustainability and Climate Change, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel
| | - David Pellow
- Blavatnik School of Computer Science, Tel Aviv University, 69978, Tel Aviv, Israel
| | - Liron Levin
- Bioinformatics Core Facility, llse Katz Institute for Nanoscale Science and Technology, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel
| | - Shai Pilosof
- Department of Life Sciences, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel
- The Goldman Sonnenfeldt School of Sustainability and Climate Change, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel
| | - Jonathan Friedman
- Institute of Environmental Sciences, Hebrew University, Rehovot, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, 69978, Tel Aviv, Israel
| | - Itzhak Mizrahi
- National Institute of Biotechnology in the Negev, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel.
- Department of Life Sciences, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel.
- The Goldman Sonnenfeldt School of Sustainability and Climate Change, Ben-Gurion University of the Negev, 8410501, Be'er Sheva, Israel.
| |
Collapse
|
2
|
Pellow D, Pu L, Ekim B, Kotlar L, Berger B, Shamir R, Orenstein Y. Efficient minimizer orders for large values of k using minimum decycling sets. Genome Res 2023; 33:1154-1161. [PMID: 37558282 PMCID: PMC10538483 DOI: 10.1101/gr.277644.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/20/2023] [Indexed: 08/11/2023]
Abstract
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.
Collapse
Affiliation(s)
- David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel
| | - Lianrong Pu
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel
| | - Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Lior Kotlar
- Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel;
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat-Gan 5290002, Israel;
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel
| |
Collapse
|
3
|
Rempel J, Ray I, Hessl E, Vazin J, Zhou Z, Kim S, Zhang X, Ding C, He Z, Pellow D, Cohen A. The Human Right to Water: A 20-Year Comparative Analysis of Arsenic in Rural and Carceral Drinking Water Systems in California. Environ Health Perspect 2022; 130:97701. [PMID: 36129293 PMCID: PMC9491218 DOI: 10.1289/ehp10758] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 08/09/2022] [Accepted: 08/17/2022] [Indexed: 05/22/2023]
Affiliation(s)
- Jenny Rempel
- Energy and Resources Group, Rausser College of Natural Resources, University of California (UC)–Berkeley, Berkeley, California, USA
| | - Isha Ray
- Energy and Resources Group, Rausser College of Natural Resources, University of California (UC)–Berkeley, Berkeley, California, USA
| | - Ethan Hessl
- Molecular Environmental Biology, Rausser College of Natural Resources, UC-Berkeley, Berkeley, California, USA
| | - Jasmine Vazin
- Global Environmental Justice Project, UC-Santa Barbara, Santa Barbara, California, USA
| | - Zehui Zhou
- Electrical Engineering and Computer Sciences, College of Engineering, UC-Berkeley, Berkeley, California, USA
| | - Shin Kim
- Electrical Engineering and Computer Sciences, College of Engineering, UC-Berkeley, Berkeley, California, USA
| | - Xuan Zhang
- Electrical Engineering and Computer Sciences, College of Engineering, UC-Berkeley, Berkeley, California, USA
| | - Chiyu Ding
- Electrical Engineering and Computer Sciences, College of Engineering, UC-Berkeley, Berkeley, California, USA
| | - Ziyi He
- Statistics, College of Letters and Science, UC-Berkeley, Berkeley, California, USA
| | - David Pellow
- Environmental Studies Program, UC-Santa Barbara, Santa Barbara, California, USA
| | - Alasdair Cohen
- Department of Population Health Sciences, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, Virginia, USA
- Department of Civil and Environmental Engineering, Virginia Polytechnic Institute and State University (Virginia Tech), Blacksburg, Virginia, USA
| |
Collapse
|
4
|
Flomin D, Pellow D, Shamir R. Data Set-Adaptive Minimizer Order Reduces Memory Usage in k-Mer Counting. J Comput Biol 2022; 29:825-838. [DOI: 10.1089/cmb.2021.0599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Dan Flomin
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| |
Collapse
|
5
|
Pellow D, Zorea A, Probst M, Furman O, Segal A, Mizrahi I, Shamir R. SCAPP: an algorithm for improved plasmid assembly in metagenomes. Microbiome 2021; 9:144. [PMID: 34172093 PMCID: PMC8228940 DOI: 10.1186/s40168-021-01068-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Accepted: 04/01/2021] [Indexed: 05/28/2023]
Abstract
BACKGROUND Metagenomic sequencing has led to the identification and assembly of many new bacterial genome sequences. These bacteria often contain plasmids: usually small, circular double-stranded DNA molecules that may transfer across bacterial species and confer antibiotic resistance. These plasmids are generally less studied and understood than their bacterial hosts. Part of the reason for this is insufficient computational tools enabling the analysis of plasmids in metagenomic samples. RESULTS We developed SCAPP (Sequence Contents-Aware Plasmid Peeler)-an algorithm and tool to assemble plasmid sequences from metagenomic sequencing. SCAPP builds on some key ideas from the Recycler algorithm while improving plasmid assemblies by integrating biological knowledge about plasmids. We compared the performance of SCAPP to Recycler and metaplasmidSPAdes on simulated metagenomes, real human gut microbiome samples, and a human gut plasmidome dataset that we generated. We also created plasmidome and metagenome data from the same cow rumen sample and used the parallel sequencing data to create a novel assessment procedure. Overall, SCAPP outperformed Recycler and metaplasmidSPAdes across this wide range of datasets. CONCLUSIONS SCAPP is an easy to use Python package that enables the assembly of full plasmid sequences from metagenomic samples. It outperformed existing metagenomic plasmid assemblers in most cases and assembled novel and clinically relevant plasmids in samples we generated such as a human gut plasmidome. SCAPP is open-source software available from: https://github.com/Shamir-Lab/SCAPP . Video abstract.
Collapse
Affiliation(s)
- David Pellow
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801 Israel
| | - Alvah Zorea
- Department of Life Sciences, Ben-Gurion University of the Negev and the National Institute for Biotechnology in the Negev, Beer-Sheva, 8410501 Israel
| | - Maraike Probst
- Institute of Microbiology, University of Innsbruck, Innsbruck, A-6020 Austria
| | - Ori Furman
- Department of Life Sciences, Ben-Gurion University of the Negev and the National Institute for Biotechnology in the Negev, Beer-Sheva, 8410501 Israel
| | - Arik Segal
- Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, 8410501 Israel
- Soroka University Medical Center, Beer-Sheva, 8410501 Israel
| | - Itzhak Mizrahi
- Department of Life Sciences, Ben-Gurion University of the Negev and the National Institute for Biotechnology in the Negev, Beer-Sheva, 8410501 Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801 Israel
| |
Collapse
|
6
|
Abstract
Motivation The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. Results We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence. Availability and Implementation The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.
Collapse
Affiliation(s)
- Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
| | - David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Daniel Bork
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Yaron Orenstein
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, MA, USA
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
7
|
|
8
|
Orenstein Y, Pellow D, Marçais G, Shamir R, Kingsford C. Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing. PLoS Comput Biol 2017; 13:e1005777. [PMID: 28968408 PMCID: PMC5645146 DOI: 10.1371/journal.pcbi.1005777] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 10/17/2017] [Accepted: 09/18/2017] [Indexed: 11/25/2022] Open
Abstract
With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively. High-throughput sequencing data has been accumulating at an extreme pace. The need to efficiently analyze and process it has become a critical challenge of the field. Many of the data structures and algorithms for this task rely on k-mer sets (DNA words of length k) to represent the sequences in a dataset. The runtime and memory usage of these highly depend on the size of the k-mer sets used. Thus, a minimum-size k-mer hitting set, namely, a set of k-mers that hit (have non-empty overlap with) all sequences, is desirable. In this work, we create universal k-mer hitting sets that hit any L-long sequence. We present several heuristic approaches for constructing such small sets; the approaches vary in the trade-off between the size of the produced set and runtime and memory usage. We show the benefit in practice of using the produced universal k-mer hitting sets compared to minimizers and randomly created hitting sets on the human genome.
Collapse
Affiliation(s)
- Yaron Orenstein
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massasschusetts, United States of America
| | - David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Guillaume Marçais
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
- * E-mail: (CK); (RS)
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- * E-mail: (CK); (RS)
| |
Collapse
|
9
|
Abstract
Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
Collapse
Affiliation(s)
- David Pellow
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | | | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| |
Collapse
|
10
|
|
11
|
Abstract
The role of working class Asian Americans/Pacific Islanders in Silicon Valley’s high technology revolution has been obscured by imposed silences, erasures, and a fixation on the relatively few who have become wealthy from the electronics boom. In this article we consider the thousands of Asians/Pacific Islanders who make Silicon Valley possible by producing the hardware that runs the machinery upon which this modern day empire was built. In particular, we address the health hazards experienced by those involved in home-based piecework. In addition, we consider a range of industry practices that produce and reinforce oppression among these workers. The low profile of working class AAPI workers in Silicon Valley enables industry to withhold occupational and environmental safety improvements, repress efforts to organize unions, and maintain oppressive workplace cultures. Finally, we examine oppositional strategies among AAPI laborers to make themselves seen and heard on the shopfloor and in the community.
Collapse
|
12
|
Affiliation(s)
- D Pellow
- Department of Anthropology, Syracuse University, New York 13244-1200, USA
| |
Collapse
|