1
|
Serrano E, Chandrasekaran SN, Bunten D, Brewer KI, Tomkinson J, Kern R, Bornholdt M, Fleming S, Pei R, Arevalo J, Tsang H, Rubinetti V, Tromans-Coia C, Becker T, Weisbart E, Bunne C, Kalinin AA, Senft R, Taylor SJ, Jamali N, Adeboye A, Abbasi HS, Goodman A, Caicedo JC, Carpenter AE, Cimini BA, Singh S, Way GP. Reproducible image-based profiling with Pycytominer. ArXiv 2023:arXiv:2311.13417v1. [PMID: 38045474 PMCID: PMC10690292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Technological advances in high-throughput microscopy have facilitated the acquisition of cell images at a rapid pace, and data pipelines can now extract and process thousands of image-based features from microscopy images. These features represent valuable single-cell phenotypes that contain information about cell state and biological processes. The use of these features for biological discovery is known as image-based or morphological profiling. However, these raw features need processing before use and image-based profiling lacks scalable and reproducible open-source software. Inconsistent processing across studies makes it difficult to compare datasets and processing steps, further delaying the development of optimal pipelines, methods, and analyses. To address these issues, we present Pycytominer, an open-source software package with a vibrant community that establishes an image-based profiling standard. Pycytominer has a simple, user-friendly Application Programming Interface (API) that implements image-based profiling functions for processing high-dimensional morphological features extracted from microscopy images of cells. Establishing Pycytominer as a standard image-based profiling toolkit ensures consistent data processing pipelines with data provenance, therefore minimizing potential inconsistencies and enabling researchers to confidently derive accurate conclusions and discover novel insights from their data, thus driving progress in our field.
Collapse
|
2
|
Abdill RJ, Graham SP, Rubinetti V, Albert FW, Greene CS, Davis S, Blekhman R. Integration of 168,000 samples reveals global patterns of the human gut microbiome. bioRxiv 2023:2023.10.11.560955. [PMID: 37873416 PMCID: PMC10592789 DOI: 10.1101/2023.10.11.560955] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Understanding the factors that shape variation in the human microbiome is a major goal of research in biology. While other genomics fields have used large, pre-compiled compendia to extract systematic insights requiring otherwise impractical sample sizes, there has been no comparable resource for the 16S rRNA sequencing data commonly used to quantify microbiome composition. To help close this gap, we have assembled a set of 168,484 publicly available human gut microbiome samples, processed with a single pipeline and combined into the largest unified microbiome dataset to date. We use this resource, which is freely available at microbiomap.org, to shed light on global variation in the human gut microbiome. We find that Firmicutes, particularly Bacilli and Clostridia, are almost universally present in the human gut. At the same time, the relative abundance of the 65 most common microbial genera differ between at least two world regions. We also show that gut microbiomes in undersampled world regions, such as Central and Southern Asia, differ significantly from the more thoroughly characterized microbiomes of Europe and Northern America. Moreover, humans in these overlooked regions likely harbor hundreds of taxa that have not yet been discovered due to this undersampling, highlighting the need for diversity in microbiome studies. We anticipate that this new compendium can serve the community and enable advanced applied and methodological research.
Collapse
Affiliation(s)
- Richard J. Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA
| | - Samantha P. Graham
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, USA
| | - Vincent Rubinetti
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Frank W. Albert
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, USA
| | - Casey S. Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Sean Davis
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Ran Blekhman
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, Illinois, USA
| |
Collapse
|
3
|
Avila R, Rubinetti V, Zhou X, Hu D, Qian Z, Cano MA, Rodolpho E, Tsueng G, Greene C, Wu C. MyGeneset.info: an interactive and programmatic platform for community-curated and user-created collections of genes. Nucleic Acids Res 2023; 51:W350-W356. [PMID: 37070209 PMCID: PMC10481249 DOI: 10.1093/nar/gkad289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 03/28/2023] [Accepted: 04/13/2023] [Indexed: 04/19/2023] Open
Abstract
Gene definitions and identifiers can be painful to manage-more so when trying to include gene function annotations as this can be highly context-dependent. Creating groups of genes or gene sets can help provide such context, but it compounds the issue as each gene within the gene set can map to multiple identifiers and have annotations derived from multiple sources. We developed MyGeneset.info to provide an API for integrated annotations for gene sets suitable for use in analytical pipelines or web servers. Leveraging our previous work with MyGene.info (a server that provides gene-centric annotations and identifiers), MyGeneset.info addresses the challenge of managing gene sets from multiple resources. With our API, users readily have read-only access to gene sets imported from commonly-used resources such as Wikipathways, CTD, Reactome, SMPDB, MSigDB, GO, and DO. In addition to supporting the access and reuse of approximately 180k gene sets from humans, common model organisms (mice, yeast, etc.), and less-common ones (e.g. black cottonwood tree), MyGeneset.info supports user-created gene sets, providing an important means for making gene sets more FAIR. User-created gene sets can serve as a way to store and manage collections for analysis or easy dissemination through a consistent API.
Collapse
Affiliation(s)
- Ricardo Avila
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Vincent Rubinetti
- Department of Biochemistry and Molecular Genetics, Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA
| | - Xinghua Zhou
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Dongbo Hu
- Department of Biochemistry and Molecular Genetics, Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA
| | - Zhongchao Qian
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Marco Alvarado Cano
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Everaldo Rodolpho
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ginger Tsueng
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Casey Greene
- Department of Biochemistry and Molecular Genetics, Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA
| | - Chunlei Wu
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| |
Collapse
|
4
|
Nicholson DN, Alquaddoomi F, Rubinetti V, Greene CS. Changing word meanings in biomedical literature reveal pandemics and new technologies. BioData Min 2023; 16:16. [PMID: 37147665 PMCID: PMC10161184 DOI: 10.1186/s13040-023-00332-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/24/2023] [Indexed: 05/07/2023] Open
Abstract
While we often think of words as having a fixed meaning that we use to describe a changing world, words are also dynamic and changing. Scientific research can also be remarkably fast-moving, with new concepts or approaches rapidly gaining mind share. We examined scientific writing, both preprint and pre-publication peer-reviewed text, to identify terms that have changed and examine their use. One particular challenge that we faced was that the shift from closed to open access publishing meant that the size of available corpora changed by over an order of magnitude in the last two decades. We developed an approach to evaluate semantic shift by accounting for both intra- and inter-year variability using multiple integrated models. This analysis revealed thousands of change points in both corpora, including for terms such as 'cas9', 'pandemic', and 'sars'. We found that the consistent change-points between pre-publication peer-reviewed and preprinted text are largely related to the COVID-19 pandemic. We also created a web app for exploration that allows users to investigate individual terms ( https://greenelab.github.io/word-lapse/ ). To our knowledge, our research is the first to examine semantic shift in biomedical preprints and pre-publication peer-reviewed text, and provides a foundation for future work to understand how terms acquire new meanings and how peer review affects this process.
Collapse
Affiliation(s)
- David N Nicholson
- Genomics and Computational Biology Program, University of Pennsylvania, Philadelpia, PA, USA
| | - Faisal Alquaddoomi
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Vincent Rubinetti
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Casey S Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
- Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA.
| |
Collapse
|
5
|
Himmelstein DS, Zietz M, Rubinetti V, Kloster K, Heil BJ, Alquaddoomi F, Hu D, Nicholson DN, Hao Y, Sullivan BD, Nagle MW, Greene CS. Hetnet connectivity search provides rapid insights into how two biomedical entities are related. bioRxiv 2023:2023.01.05.522941. [PMID: 36711546 PMCID: PMC9882000 DOI: 10.1101/2023.01.05.522941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Hetnets, short for "heterogeneous networks", contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes - including genes, diseases, drugs, pathways, and anatomical structures - with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search . We provide an open source implementation of these methods in our new Python package named hetmatpy .
Collapse
Affiliation(s)
- Daniel S. Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America; Related Sciences
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America; Department of Biomedical Informatics, Columbia University, New York, New York, United States of America
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America; Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Kyle Kloster
- Carbon, Inc.; Department of Computer Science, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Benjamin J. Heil
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania
| | - Faisal Alquaddoomi
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America; Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Dongbo Hu
- Department of Pathology, Perelman School of Medicine University of Pennsylvania, Philadelphia PA, USA
| | - David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia PA, USA
| | - Yun Hao
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia PA, USA
| | | | - Michael W. Nagle
- Integrative Biology, Internal Medicine Research Unit, Worldwide Research, Development, and Medicine, Pfizer Inc, Cambridge, Massachusetts, United States of America; Neurogenomics, Translational Sciences, Neurology Business Group, Eisai Inc, Cambridge, Massachusetts, United States of America
| | - Casey S. Greene
- Correspondence possible via GitHub Issues or Casey S. Greene <>
| |
Collapse
|
6
|
Himmelstein DS, Zietz M, Rubinetti V, Kloster K, Heil BJ, Alquaddoomi F, Hu D, Nicholson DN, Hao Y, Sullivan BD, Nagle MW, Greene CS. Hetnet connectivity search provides rapid insights into how biomedical entities are related. Gigascience 2022; 12:giad047. [PMID: 37503959 PMCID: PMC10375517 DOI: 10.1093/gigascience/giad047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 04/14/2023] [Accepted: 06/06/2023] [Indexed: 07/29/2023] Open
Abstract
BACKGROUND Hetnets, short for "heterogeneous networks," contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet, connects 11 types of nodes-including genes, diseases, drugs, pathways, and anatomical structures-with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious about not only how metformin is related to breast cancer but also how a given gene might be involved in insomnia. FINDINGS We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any 2 nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. CONCLUSION We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open-source implementation of these methods in our new Python package named hetmatpy.
Collapse
Affiliation(s)
- Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Related Sciences, Denver, CO 80202, USA
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Kyle Kloster
- Carbon, Inc., Redwood City, CA 94063, USA
- Department of Computer Science, North Carolina State University, Raleigh, NC 27606, USA
| | - Benjamin J Heil
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Faisal Alquaddoomi
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Dongbo Hu
- Department of Pathology, Perelman School of Medicine University of Pennsylvania, Philadelphia, PA 19104, USA
| | - David N Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Yun Hao
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Blair D Sullivan
- School of Computing, University of Utah, Salt Lake City, UT 84112, USA
| | - Michael W Nagle
- Integrative Biology, Internal Medicine Research Unit, Worldwide Research, Development, and Medicine, Pfizer Inc, Cambridge, MA 02139, USA
- Human Biology Integration Foundation, Deep Human Biology Learning, Eisai Inc., Cambridge, MA 02140, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO 80045, USA
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO 80045, USA
| |
Collapse
|
7
|
Nicholson DN, Rubinetti V, Hu D, Thielk M, Hunter LE, Greene CS. Examining linguistic shifts between preprints and publications. PLoS Biol 2022; 20:e3001470. [PMID: 35104289 PMCID: PMC8806061 DOI: 10.1371/journal.pbio.3001470] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 11/05/2021] [Indexed: 11/19/2022] Open
Abstract
Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint-peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (https://greenelab.github.io/preprint-similarity-search/) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Dongbo Hu
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Marvin Thielk
- Elsevier, Philadelphia, Pennsylvania, United States of America
| | - Lawrence E. Hunter
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- Center for Health AI, University of Colorado School of Medicine, Aurora, Colorado, United States of America
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, Colorado, United States of America
| |
Collapse
|
8
|
Rando HM, Boca SM, McGowan LD, Himmelstein DS, Robson MP, Rubinetti V, Velazquez R, Greene CS, Gitter A. An Open-Publishing Response to the COVID-19 Infodemic. ArXiv 2021:arXiv:2109.08633v1. [PMID: 34545336 PMCID: PMC8452106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript's figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.
Collapse
Affiliation(s)
- Halie M Rando
- University of Colorado School of Medicine, Center for Health AI, Aurora, CO, USA
- University of Colorado School of Medicine, Department of Biochemistry and Molecular Genetics, Aurora, CO, USA
- University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA
| | - Simina M Boca
- Georgetown University Medical Center, Innovation Center for Biomedical Informatics, Washington, DC, USA
| | | | - Daniel S Himmelstein
- University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA
- Related Sciences
| | - Michael P Robson
- Villanova University, Department of Computing Sciences, Villanova, PA, USA
| | - Vincent Rubinetti
- University of Colorado School of Medicine, Center for Health AI, Aurora, CO, USA
- University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA
| | | | - Casey S Greene
- University of Colorado School of Medicine, Center for Health AI, Aurora, CO, USA
- University of Colorado School of Medicine, Department of Biochemistry and Molecular Genetics, Aurora, CO, USA
- University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA
- Alex's Lemonade Stand Foundation, Childhood Cancer Data Lab, Philadelphia, PA, USA
| | - Anthony Gitter
- University of Wisconsin-Madison, Department of Biostatistics and Medical Informatics, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| |
Collapse
|
9
|
Rando HM, Boca SM, McGowan LD, Himmelstein DS, Robson MP, Rubinetti V, Velazquez R, Greene CS, Gitter A. An Open-Publishing Response to the COVID-19 Infodemic. CEUR Workshop Proc 2021; 2976:29-38. [PMID: 35558551 PMCID: PMC9093051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
The COVID-19 pandemic catalyzed the rapid dissemination of papers and preprints investigating the disease and its associated virus, SARS-CoV-2. The multifaceted nature of COVID-19 demands a multidisciplinary approach, but the urgency of the crisis combined with the need for social distancing measures present unique challenges to collaborative science. We applied a massive online open publishing approach to this problem using Manubot. Through GitHub, collaborators summarized and critiqued COVID-19 literature, creating a review manuscript. Manubot automatically compiled citation information for referenced preprints, journal publications, websites, and clinical trials. Continuous integration workflows retrieved up-to-date data from online sources nightly, regenerating some of the manuscript's figures and statistics. Manubot rendered the manuscript into PDF, HTML, LaTeX, and DOCX outputs, immediately updating the version available online upon the integration of new content. Through this effort, we organized over 50 scientists from a range of backgrounds who evaluated over 1,500 sources and developed seven literature reviews. While many efforts from the computational community have focused on mining COVID-19 literature, our project illustrates the power of open publishing to organize both technical and non-technical scientists to aggregate and disseminate information in response to an evolving crisis.
Collapse
Affiliation(s)
- Halie M. Rando
- University of Colorado School of Medicine, Center for Health AI, Aurora, CO, USA, University of Colorado School of Medicine, Department of Biochemistry and Molecular Genetics, Aurora, CO, USA, University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA
| | - Simina M. Boca
- Georgetown University Medical Center, Innovation Center for Biomedical Informatics, Washington, DC, USA
| | | | - Daniel S. Himmelstein
- University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA, Related Sciences
| | - Michael P. Robson
- Villanova University, Department of Computing Sciences, Villanova, PA, USA
| | - Vincent Rubinetti
- University of Colorado School of Medicine, Center for Health AI, Aurora, CO, USA, University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA
| | | | - Casey S. Greene
- University of Colorado School of Medicine, Center for Health AI, Aurora, CO, USA, University of Colorado School of Medicine, Department of Biochemistry and Molecular Genetics, Aurora, CO, USA, University of Pennsylvania, Perelman School of Medicine, Department of Systems Pharmacology and Translational Therapeutics, Philadelphia, PA, USA, Alex’s Lemonade Stand Foundation, Childhood Cancer Data Lab, Philadelphia, PA, USA
| | - Anthony Gitter
- University of Wisconsin-Madison, Department of Biostatistics and Medical Informatics, Madison, WI, USA, Morgridge Institute for Research, Madison, WI, USA
| |
Collapse
|
10
|
Way GP, Zietz M, Rubinetti V, Himmelstein DS, Greene CS. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol 2020; 21:109. [PMID: 32393369 PMCID: PMC7212571 DOI: 10.1186/s13059-020-02021-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 04/16/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Collapse
Affiliation(s)
- Gregory P Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, 19102, USA.
| |
Collapse
|
11
|
Himmelstein DS, Rubinetti V, Slochower DR, Hu D, Malladi VS, Greene CS, Gitter A. Open collaborative writing with Manubot. PLoS Comput Biol 2019; 15:e1007128. [PMID: 31233491 PMCID: PMC6611653 DOI: 10.1371/journal.pcbi.1007128] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Revised: 07/05/2019] [Accepted: 05/24/2019] [Indexed: 01/08/2023] Open
Abstract
Open, collaborative research is a powerful paradigm that can immensely strengthen the scientific process by integrating broad and diverse expertise. However, traditional research and multi-author writing processes break down at scale. We present new software named Manubot, available at https://manubot.org, to address the challenges of open scholarly writing. Manubot adopts the contribution workflow used by many large-scale open source software projects to enable collaborative authoring of scholarly manuscripts. With Manubot, manuscripts are written in Markdown and stored in a Git repository to precisely track changes over time. By hosting manuscript repositories publicly, such as on GitHub, multiple authors can simultaneously propose and review changes. A cloud service automatically evaluates proposed changes to catch errors. Publication with Manubot is continuous: When a manuscript's source changes, the rendered outputs are rebuilt and republished to a web page. Manubot automates bibliographic tasks by implementing citation by identifier, where users cite persistent identifiers (e.g. DOIs, PubMed IDs, ISBNs, URLs), whose metadata is then retrieved and converted to a user-specified style. Manubot modernizes publishing to align with the ideals of open science by making it transparent, reproducible, immediate, versioned, collaborative, and free of charge.
Collapse
Affiliation(s)
- Daniel S. Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - David R. Slochower
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, San Diego, California, United States of America
| | - Dongbo Hu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Venkat S. Malladi
- Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Bioinformatics Core Facility, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, Wisconsin, United States of America
- Morgridge Institute for Research, Madison, Wisconsin, United States of America
| |
Collapse
|