1
|
Song J, Kurgan L. Availability of web servers significantly boosts citations rates of bioinformatics methods for protein function and disorder prediction. BIOINFORMATICS ADVANCES 2023; 3:vbad184. [PMID: 38146538 PMCID: PMC10749743 DOI: 10.1093/bioadv/vbad184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 12/08/2023] [Accepted: 12/15/2023] [Indexed: 12/27/2023]
Abstract
Motivation Development of bioinformatics methods is a long, complex and resource-hungry process. Hundreds of these tools were released. While some methods are highly cited and used, many suffer relatively low citation rates. We empirically analyze a large collection of recently released methods in three diverse protein function and disorder prediction areas to identify key factors that contribute to increased citations. Results We show that provision of a working web server significantly boosts citation rates. On average, methods with working web servers generate three times as many citations compared to tools that are available as only source code, have no code and no server, or are no longer available. This observation holds consistently across different research areas and publication years. We also find that differences in predictive performance are unlikely to impact citation rates. Overall, our empirical results suggest that a relatively low-cost investment into the provision and long-term support of web servers would substantially increase the impact of bioinformatics tools.
Collapse
Affiliation(s)
- Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Clayton, VIC 3800, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, United States
| |
Collapse
|
2
|
Insana G, Ignatchenko A, Martin M, Bateman A. MBDBMetrics: an online metrics tool to measure the impact of biological data resources. BIOINFORMATICS ADVANCES 2023; 3:vbad180. [PMID: 38130879 PMCID: PMC10733715 DOI: 10.1093/bioadv/vbad180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 11/13/2023] [Indexed: 12/23/2023]
Abstract
Motivation There now exist thousands of molecular biology databases covering every aspect of biological data. This database infrastructure takes significant effort and funding to develop and maintain. The creators of these databases need to make strong justifications to funders to prove their impact or importance. There are many publication metrics and tools available such as Google Scholar to measure citation impact or AltMetrics covering multiple measures including social media coverage. Results In this article, we describe a series of novel impact metrics that have been applied initially to the UniProt database, and now made available via a Google Colab to enable any molecular biology resource to gain several additional metrics. These metrics, powered by freely available APIs from Europe PubMedCentral and SureCHEMBL cover mentions of the resource in full text articles, including which section of the paper the mention occurs in, grant acknowledgements and mentions in patent applications. This tool, that we call MBDBMetrics, is a useful adjunct to existing tools. Availability and implementation The MBDBMetrics tool is available at the following locations: https://colab.research.google.com/drive/1aEmSQR9DGQIZmHAIuQV9mLv7Mw9Ppkin and https://github.com/g-insana/MBDBMetrics.
Collapse
Affiliation(s)
- Giuseppe Insana
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Alex Ignatchenko
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| |
Collapse
|
3
|
Song Z, Wang Y, Lin P, Yang K, Jiang X, Dong J, Xie S, Rao R, Cui L, Liu F, Huang X. Identification of key modules and driving genes in nonalcoholic fatty liver disease by weighted gene co-expression network analysis. BMC Genomics 2023; 24:414. [PMID: 37488473 PMCID: PMC10364401 DOI: 10.1186/s12864-023-09458-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 06/16/2023] [Indexed: 07/26/2023] Open
Abstract
BACKGROUND Nonalcoholic fatty liver disease (NAFLD) is characterized by excessive liver fat deposition, and progresses to liver cirrhosis, and even hepatocellular carcinoma. However, the invasive diagnosis of NAFLD with histopathological evaluation remains risky. This study investigated potential genes correlated with NAFLD, which may serve as diagnostic biomarkers and even potential treatment targets. METHODS The weighted gene co-expression network analysis (WGCNA) was constructed based on dataset E-MEXP-3291. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses were performed to evaluate the function of genes. RESULTS Blue module was positively correlated, and turquoise module negatively correlated with the severity of NAFLD. Furthermore, 8 driving genes (ANXA9, FBXO2, ORAI3, NAGS, C/EBPα, CRYAA, GOLM1, TRIM14) were identified from the overlap of genes in blue module and GSE89632. And another 8 driving genes were identified from the overlap of turquoise module and GSE89632. Among these driving genes, C/EBPα (CCAAT/enhancer binding protein α) was the most notable. By validating the expression of C/EBPα in the liver of NAFLD mice using immunohistochemistry, we discovered a significant upregulation of C/EBPα protein in NAFLD. CONCLUSION we identified two modules and 16 driving genes associated with the progression of NAFLD, and confirmed the protein expression of C/EBPα, which had been paid little attention to in the context of NAFLD, in the present study. Our study will advance the understanding of NAFLD. Moreover, these driving genes may serve as biomarkers and therapeutic targets of NAFLD.
Collapse
Affiliation(s)
- Zhengmao Song
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China
| | - Yun Wang
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China
| | - Pingli Lin
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China
| | - Kaichun Yang
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China
| | - Xilin Jiang
- Zhongshan Hospital, Xiamen University, Xiamen, China
- School of Medicine, Xiamen University, Xiamen, China
| | - Junchen Dong
- School of Medicine, Xiamen University, Xiamen, China
| | - Shangjin Xie
- Xiang'an Hospital, Xiamen University, Xiamen, China
| | - Rong Rao
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China.
| | - Lishan Cui
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China.
| | - Feng Liu
- The Fifth Hospital of Xiamen & Xiamen University, Xiamen, China.
- Xiang'an Hospital, Xiamen University, Xiamen, China.
| | - Xuefeng Huang
- Zhongshan Hospital, Xiamen University, Xiamen, China.
| |
Collapse
|
4
|
Savonen C, Wright C, Hoffman AM, Muschelli J, Cox K, Tan FJ, Leek JT. Open-source Tools for Training Resources - OTTR. JOURNAL OF STATISTICS AND DATA SCIENCE EDUCATION : AN OFFICIAL JOURNAL OF THE OF THE AMERICAN STATISTICAL ASSOCIATION 2023; 31:57-65. [PMID: 37207236 PMCID: PMC10193921 DOI: 10.1080/26939169.2022.2118646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources and vignettes that accompany these tools often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining these training resources. OTTR empowers creators to customize their work and allows for a simple workflow to publish using multiple platforms. OTTR allows content creators to publish training material to multiple massive online learner communities using familiar rendering mechanics. OTTR allows the incorporation of pedagogical practices like formative and summative assessments in the form of multiple choice questions and fill in the blank problems that are automatically graded. No local installation of any software is required to begin creating content with OTTR. Thus far, 15 training courses have been created with OTTR repository template. By using the OTTR system, the maintenance workload for updating these courses across platforms has been drastically reduced. For more information about OTTR and how to get started, go to ottrproject.org.
Collapse
Affiliation(s)
- Candace Savonen
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
- Fred Hutchinson Cancer Center, Seattle, WA
- Corresponding author:
| | - Carrie Wright
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
- Fred Hutchinson Cancer Center, Seattle, WA
| | - Ava M. Hoffman
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
- Fred Hutchinson Cancer Center, Seattle, WA
| | - John Muschelli
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
| | - Katherine Cox
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
| | | | - Jeffrey T. Leek
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
- Fred Hutchinson Cancer Center, Seattle, WA
| |
Collapse
|
5
|
Steenwyk JL, Buida Iii TJ, Gonçalves C, Goltz DC, Morales G, Mead ME, LaBella AL, Chavez CM, Schmitz JE, Hadjifrangiskou M, Li Y, Rokas A. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. Genetics 2022; 221:6583183. [PMID: 35536198 PMCID: PMC9252278 DOI: 10.1093/genetics/iyac079] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 05/03/2022] [Indexed: 11/14/2022] Open
Abstract
Bioinformatic analysis-such as genome assembly quality assessment, alignment summary statistics, relative synonymous codon usage, file format conversion, and processing and analysis-is integrated into diverse disciplines in the biological sciences. Several command-line pieces of software have been developed to conduct some of these individual analyses, but unified toolkits that conduct all these analyses are lacking. To address this gap, we introduce BioKIT, a versatile command line toolkit that has, upon publication, 42 functions, several of which were community-sourced, that conduct routine and novel processing and analysis of genome assemblies, multiple sequence alignments, coding sequences, sequencing data, and more. To demonstrate the utility of BioKIT, we conducted a comprehensive examination of relative synonymous codon usage across 171 fungal genomes that use alternative genetic codes, showed that the novel metric of gene-wise relative synonymous codon usage can accurately estimate gene-wise codon optimization, evaluated the quality and characteristics of 901 eukaryotic genome assemblies, and calculated alignment summary statistics for 10 phylogenomic data matrices. BioKIT will be helpful in facilitating and streamlining sequence analysis workflows. BioKIT is freely available under the MIT license from GitHub (https://github.com/JLSteenwyk/BioKIT), PyPi (https://pypi.org/project/jlsteenwyk-biokit/), and the Anaconda Cloud (https://anaconda.org/jlsteenwyk/jlsteenwyk-biokit). Documentation, user tutorials, and instructions for requesting new features are available online (https://jlsteenwyk.com/BioKIT).
Collapse
Affiliation(s)
- Jacob L Steenwyk
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | | | - Carla Gonçalves
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA.,Associate Laboratory i4HB-Institute for Health and Bioeconomy, NOVA School of Science and Technology, NOVA University Lisbon, 2819-516 Caparica, Portugal.,UCIBIO-Applied Molecular Biosciences Unit, Department of Life Sciences, NOVA School of Science and Technology, NOVA University Lisbon, 2819-516 Caparica, Portugal
| | | | - Grace Morales
- Department of Pathology, Microbiology & Immunology, Center for Personalized Microbiology, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Matthew E Mead
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Abigail L LaBella
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Christina M Chavez
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| | - Jonathan E Schmitz
- Department of Pathology, Microbiology & Immunology, Center for Personalized Microbiology, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Maria Hadjifrangiskou
- Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA.,Department of Pathology, Microbiology & Immunology, Center for Personalized Microbiology, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Yuanning Li
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA
| | - Antonis Rokas
- Department of Biological Sciences, Vanderbilt University, VU Station B #35-1634, Nashville, TN 37235, USA.,Evolutionary Studies Initiative, Vanderbilt University, Nashville, TN 37235, USA
| |
Collapse
|
6
|
Gardner PP, Paterson JM, McGimpsey S, Ashari-Ghomi F, Umu SU, Pawlik A, Gavryushkin A, Black MA. Sustained software development, not number of citations or journal choice, is indicative of accurate bioinformatic software. Genome Biol 2022; 23:56. [PMID: 35172880 PMCID: PMC8851831 DOI: 10.1186/s13059-022-02625-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 02/06/2022] [Indexed: 11/29/2022] Open
Abstract
Background Computational biology provides software tools for testing and making inferences about biological data. In the face of increasing volumes of data, heuristic methods that trade software speed for accuracy may be employed. We have studied these trade-offs using the results of a large number of independent software benchmarks, and evaluated whether external factors, including speed, author reputation, journal impact, recency and developer efforts, are indicative of accurate software. Results We find that software speed, author reputation, journal impact, number of citations and age are unreliable predictors of software accuracy. This is unfortunate because these are frequently cited reasons for selecting software tools. However, GitHub-derived statistics and high version numbers show that accurate bioinformatic software tools are generally the product of many improvements over time. We also find an excess of slow and inaccurate bioinformatic software tools, and this is consistent across many sub-disciplines. There are few tools that are middle-of-road in terms of accuracy and speed trade-offs. Conclusions Our findings indicate that accurate bioinformatic software is primarily the product of long-term commitments to software development. In addition, we hypothesise that bioinformatics software suffers from publication bias. Software that is intermediate in terms of both speed and accuracy may be difficult to publish—possibly due to author, editor and reviewer practises. This leaves an unfortunate hole in the literature, as ideal tools may fall into this gap. High accuracy tools are not always useful if they are slow, while high speed is not useful if the results are also inaccurate. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02625-x).
Collapse
Affiliation(s)
- Paul P Gardner
- Department of Biochemistry,, University of Otago, Dunedin, New Zealand. .,Biomolecular Interaction Centre, University of Canterbury, Christchurch, New Zealand.
| | - James M Paterson
- Department of Civil and Natural Resources Engineering, University of Canterbury, Christchurch, New Zealand
| | | | - Fatemeh Ashari-Ghomi
- Research Group for Genomic Epidemiology, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Sinan U Umu
- Department of Research, Cancer Registry of Norway, Oslo, Norway
| | | | - Alex Gavryushkin
- Department of Computer Science, University of Otago, Dunedin, New Zealand.,School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| | - Michael A Black
- Department of Biochemistry,, University of Otago, Dunedin, New Zealand
| |
Collapse
|
7
|
Zhu W, Zhai X, Jia Z, Wang Y, Mo Y. Bioinformatics analysis of sequential gene expression profiling after skin and skeletal muscle wound in mice. Leg Med (Tokyo) 2021; 54:101982. [PMID: 34687982 DOI: 10.1016/j.legalmed.2021.101982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2021] [Revised: 09/26/2021] [Accepted: 10/14/2021] [Indexed: 10/20/2022]
Abstract
It is of great value to use bioinformatics methods to screen the core differentially expressed genes (DEGs) at different times after mouse skin and skeletal muscle wound, and to explore the relationship between them and the wound age. To this end, we downloaded the gene expression profiles of GSE140517 and GSE23006 from the NCBI-GEO gene database, used GEO2R online tools and Venn diagrams to screen out DEGs at different times and common-DEGs. The Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) channel analysis were carried out through the DAVID website respectively. Use STRING tool to build a Protein-protein Interaction (PPI) network, and use Cytoscape software to screen out core DEGs. The results showed that 13, 53, 43 and 13 core DEGs were screened out in the 6 h, 12 h, 24 h and common-DEGs group after wound. There were 7 core DEGs (Cxcl2, Cxcl3, Il1b, Ptgs2, Cxcl1, Timp1, Ccl3) in both the different time point and the common DEGs group. Meanwhile, there are 1 core DEGs (Ccl4) specifically expressed in the 6 h, 29 specifically expressed core DEGs (Isg20, Rtp4, Fcgr1, Ifi44, Trim30a, etc.) in the 12 h, and 18 specifically expressed core DEGs (Ccr7, Myd88, Igsf6, Ccr2, Gpsm3, etc.) in the 24 h, there are 6 core DEGs (Ccl4, Ccl7, Saa3, Cxcl5, Ccl2, Lcn2) specifically expressed in the common-DEGs group. The results of GO and KEGG analysis showed that the deterioration and exudation of the inflammatory response were the main cause at 6 h after wound. In addition to inflammation at 12 h and 24 h, the systemic immune response against viral and bacterial infections also gradually increased. In summary, the core DEGs selected in this study have combined characteristics, consistent with the healing function at the corresponding time point, and they are also has specificity and correlation with wound age. Therefore, by detecting the changes in the expression of co-expressed core DEGs at different times after wound, as well as detecting specific expressed DEGs at a specific time point or a specific period of time, it is very promising to provide help for the wound age estimation. However, limited by the GSE140517 gene expression profile in the database, only the difference in gene expression at different times within 24 h after wound was explored, and the research on the late wound age still needs to be further in-depth.
Collapse
Affiliation(s)
- Weihao Zhu
- School of Forensic Medicine, Henan University of Science and Technology, Luoyang 471003, China
| | - Xiandun Zhai
- School of Forensic Medicine, Henan University of Science and Technology, Luoyang 471003, China
| | - Zelei Jia
- School of Forensic Medicine, Henan University of Science and Technology, Luoyang 471003, China
| | - Yingyi Wang
- School of Forensic Medicine, Henan University of Science and Technology, Luoyang 471003, China; First Affiliated Hospital of Zhengzhou University, Zhengzhou 450046, China
| | - Yaonan Mo
- School of Forensic Medicine, Henan University of Science and Technology, Luoyang 471003, China.
| |
Collapse
|
8
|
Chasapi A, Promponas VJ, Ouzounis CA. The bioinformatics wealth of nations. Bioinformatics 2020; 36:2963-2965. [PMID: 32129821 PMCID: PMC7203752 DOI: 10.1093/bioinformatics/btaa132] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 02/16/2020] [Accepted: 02/24/2020] [Indexed: 11/12/2022] Open
Affiliation(s)
- Anastasia Chasapi
- Biological Computation & Process Lab (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), Thessalonica, GR-57001, Greece
| | - Vasilis J Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, CY-2109, Cyprus
| | - Christos A Ouzounis
- Biological Computation & Process Lab (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), Thessalonica, GR-57001, Greece
| |
Collapse
|
9
|
Estravis-Barcala M, Mattera MG, Soliani C, Bellora N, Opgenoorth L, Heer K, Arana MV. Molecular bases of responses to abiotic stress in trees. JOURNAL OF EXPERIMENTAL BOTANY 2020; 71:3765-3779. [PMID: 31768543 PMCID: PMC7316969 DOI: 10.1093/jxb/erz532] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Accepted: 11/25/2019] [Indexed: 05/05/2023]
Abstract
Trees are constantly exposed to climate fluctuations, which vary with both time and geographic location. Environmental changes that are outside of the physiological favorable range usually negatively affect plant performance and trigger responses to abiotic stress. Long-living trees in particular have evolved a wide spectrum of molecular mechanisms to coordinate growth and development under stressful conditions, thus minimizing fitness costs. The ongoing development of techniques directed at quantifying abiotic stress has significantly increased our knowledge of physiological responses in woody plants. However, it is only within recent years that advances in next-generation sequencing and biochemical approaches have enabled us to begin to understand the complexity of the molecular systems that underlie these responses. Here, we review recent progress in our understanding of the molecular bases of drought and temperature stresses in trees, with a focus on functional, transcriptomic, epigenetic, and population genomic studies. In addition, we highlight topics that will contribute to progress in our understanding of the plastic and adaptive responses of woody plants to drought and temperature in a context of global climate change.
Collapse
Affiliation(s)
- Maximiliano Estravis-Barcala
- Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales, (Consejo Nacional de Investigaciones Científicas y Técnicas- Universidad Nacional del Comahue), San Carlos de Bariloche, Rio Negro, Argentina
| | - María Gabriela Mattera
- Instituto de Investigaciones Forestales y Agropecuarias Bariloche (Instituto Nacional de Tecnología Agropecuaria - Consejo Nacional de Investigaciones Científicas y Técnicas), San Carlos de Bariloche, Rio Negro, Argentina
| | - Carolina Soliani
- Instituto de Investigaciones Forestales y Agropecuarias Bariloche (Instituto Nacional de Tecnología Agropecuaria - Consejo Nacional de Investigaciones Científicas y Técnicas), San Carlos de Bariloche, Rio Negro, Argentina
| | - Nicolás Bellora
- Instituto Andino Patagónico de Tecnologías Biológicas y Geoambientales, (Consejo Nacional de Investigaciones Científicas y Técnicas- Universidad Nacional del Comahue), San Carlos de Bariloche, Rio Negro, Argentina
| | - Lars Opgenoorth
- Department of Ecology, Philipps University Marburg, Marburg, Germany
- Swiss Federal Research Institute WSL, BirmensdorfSwitzerland
| | - Katrin Heer
- Department of Conservation Biology, Philipps University Marburg, Marburg Germany
| | - María Verónica Arana
- Instituto de Investigaciones Forestales y Agropecuarias Bariloche (Instituto Nacional de Tecnología Agropecuaria - Consejo Nacional de Investigaciones Científicas y Técnicas), San Carlos de Bariloche, Rio Negro, Argentina
- Correspondence:
| |
Collapse
|
10
|
Brito JJ, Li J, Moore JH, Greene CS, Nogoy NA, Garmire LX, Mangul S. Recommendations to enhance rigor and reproducibility in biomedical research. Gigascience 2020; 9:giaa056. [PMID: 32479592 PMCID: PMC7263079 DOI: 10.1093/gigascience/giaa056] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/08/2020] [Accepted: 05/06/2020] [Indexed: 12/25/2022] Open
Abstract
Biomedical research depends increasingly on computational tools, but mechanisms ensuring open data, open software, and reproducibility are variably enforced by academic institutions, funders, and publishers. Publications may present software for which source code or documentation are or become unavailable; this compromises the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit subsequent work. We provide 8 recommendations to improve reproducibility, transparency, and rigor in computational biology-precisely the values that should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in life science research.
Collapse
Affiliation(s)
- Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089, USA
| | - Jun Li
- Department of Computational Medicine & Bioinformatics, Medical School, University of Michigan, 1301 Catherine Street, Ann Arbor, MI 48109, USA
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, Institute for Biomedical Informatics, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, 3400 Civic Center Boulevard, Philadelphia, PA 19104, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand, 1429 Walnut St, Floor 10, Philadelphia, PA 19102, USA
| | - Nicole A Nogoy
- GigaScience, 26/F, Kings Wing Plaza 2, 1 On Kwan Street, Shek Mun, N.T., Hong Kong
| | - Lana X Garmire
- Department of Computational Medicine & Bioinformatics, Medical School, University of Michigan, 1301 Catherine Street, Ann Arbor, MI 48109, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089, USA
| |
Collapse
|
11
|
|
12
|
Brito JJ, Mosqueiro T, Rotman J, Xue V, Chapski DJ, la Hoz JD, Matias P, Martin LS, Zelikovsky A, Pellegrini M, Mangul S. Telescope: an interactive tool for managing large-scale analysis from mobile devices. Gigascience 2020; 9:giz163. [PMID: 31972019 PMCID: PMC6977584 DOI: 10.1093/gigascience/giz163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2019] [Revised: 11/26/2019] [Accepted: 12/19/2019] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. RESULTS To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. CONCLUSIONS Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.
Collapse
Affiliation(s)
- Jaqueline J Brito
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089-9121, USA
| | - Thiago Mosqueiro
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Jeremy Rotman
- Department of Computer Science, University of California, Los Angeles, 404 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California, Los Angeles, 404 Westwood Plaza, Los Angeles, CA 90095, USA
| | - Douglas J Chapski
- Department of Anesthesiology, David Geffen School of Medicine at UCLA, 650 Charles E. Young Drive, Los Angeles, CA 90095, USA
| | - Juan De la Hoz
- Center for Neurobehavioral Genetics, University of California Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, USA
| | - Paulo Matias
- Department of Computer Science, Federal University of São Carlos, km 325 Rod. Washington Luis, São Carlos, SP 13565–905, Brazil
| | - Lana S Martin
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089-9121, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, 1 Park Place, Atlanta, GA 30303, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow 119991, Russia
| | - Matteo Pellegrini
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA 90095, USA
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, 1985 Zonal Avenue, Los Angeles, CA 90089-9121, USA
| |
Collapse
|
13
|
Wright Muelas M, Mughal F, O'Hagan S, Day PJ, Kell DB. The role and robustness of the Gini coefficient as an unbiased tool for the selection of Gini genes for normalising expression profiling data. Sci Rep 2019; 9:17960. [PMID: 31784565 PMCID: PMC6884504 DOI: 10.1038/s41598-019-54288-7] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Accepted: 11/08/2019] [Indexed: 12/13/2022] Open
Abstract
We recently introduced the Gini coefficient (GC) for assessing the expression variation of a particular gene in a dataset, as a means of selecting improved reference genes over the cohort ('housekeeping genes') typically used for normalisation in expression profiling studies. Those genes (transcripts) that we determined to be useable as reference genes differed greatly from previous suggestions based on hypothesis-driven approaches. A limitation of this initial study is that a single (albeit large) dataset was employed for both tissues and cell lines. We here extend this analysis to encompass seven other large datasets. Although their absolute values differ a little, the Gini values and median expression levels of the various genes are well correlated with each other between the various cell line datasets, implying that our original choice of the more ubiquitously expressed low-Gini-coefficient genes was indeed sound. In tissues, the Gini values and median expression levels of genes showed a greater variation, with the GC of genes changing with the number and types of tissues in the data sets. In all data sets, regardless of whether this was derived from tissues or cell lines, we also show that the GC is a robust measure of gene expression stability. Using the GC as a measure of expression stability we illustrate its utility to find tissue- and cell line-optimised housekeeping genes without any prior bias, that again include only a small number of previously reported housekeeping genes. We also independently confirmed this experimentally using RT-qPCR with 40 candidate GC genes in a panel of 10 cell lines. These were termed the Gini Genes. In many cases, the variation in the expression levels of classical reference genes is really quite huge (e.g. 44 fold for GAPDH in one data set), suggesting that the cure (of using them as normalising genes) may in some cases be worse than the disease (of not doing so). We recommend the present data-driven approach for the selection of reference genes by using the easy-to-calculate and robust GC.
Collapse
Affiliation(s)
- Marina Wright Muelas
- Department of Biochemistry, Institute of Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK.
| | - Farah Mughal
- Department of Biochemistry, Institute of Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK
| | - Steve O'Hagan
- School of Chemistry, Department of Chemistry, The Manchester Institute of Biotechnology 131, Princess Street, Manchester, M1 7DN, UK
- The Manchester Institute of Biotechnology, 131, Princess Street, Manchester, M1 7DN, UK
| | - Philip J Day
- The Manchester Institute of Biotechnology, 131, Princess Street, Manchester, M1 7DN, UK.
- Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, M13 9PL, UK.
| | - Douglas B Kell
- Department of Biochemistry, Institute of Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown Street, Liverpool, L69 7ZB, UK.
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, 10 Building 220, Kemitorvet, 2800, Kgs. Lyngby, Denmark.
| |
Collapse
|
14
|
Abstract
The computer software used for genomic analysis has become a crucial component of the infrastructure for life sciences. However, genomic software is still typically developed in an ad hoc manner, with inadequate funding, and by academic researchers not trained in software development, at substantial costs to the research community. I examine the roots of the incongruity between the importance of and the degree of investment in genomic software, and I suggest several potential remedies for current problems. As genomics continues to grow, new strategies for funding and developing the software that powers the field will become increasingly essential.
Collapse
Affiliation(s)
- Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA.
| |
Collapse
|
15
|
Lee BD, Timony MA, Ruiz P. DNAvisualization.org: a serverless web tool for DNA sequence visualization. Nucleic Acids Res 2019; 47:W20-W25. [PMID: 31170285 PMCID: PMC6602497 DOI: 10.1093/nar/gkz404] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 04/08/2019] [Accepted: 05/06/2019] [Indexed: 11/23/2022] Open
Abstract
Raw DNA sequences contain an immense amount of meaningful biological information. However, these sequences are hard for humans to intuitively interpret. To solve this problem, a number of methods have been proposed to transform DNA sequences into two-dimensional visualizations. DNAvisualization.org implements several of these methods in a cost effective and performant manner via a novel, entirely serverless architecture. By taking advantage of recent developments in serverless parallel computing and selective data retrieval, the website is able to offer users the ability to visualize up to thirty 4.5 Mb DNA sequences simultaneously using one of five supported methods and to export these visualizations in a variety of publication-ready formats.
Collapse
Affiliation(s)
- Benjamin D Lee
- In-Q-Tel Lab41, 800 El Camino Real, Suite 300, Menlo Park, CA 94025, USA
- Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA
- School of Engineering and Applied Sciences, Harvard University, 29 Oxford Street, Cambridge, MA 02138, USA
| | - Michael A Timony
- Harvard Medical School, 77 Avenue Louis Pasteur, Boston, MA 02115, USA
- SBGrid Consortium, Harvard Medical School, 250 Longwood Avenue, SGM114, Boston, MA 02115, USA
| | - Pablo Ruiz
- School of Engineering and Applied Sciences, Harvard University, 29 Oxford Street, Cambridge, MA 02138, USA
| |
Collapse
|
16
|
Interpreting and integrating big data in the life sciences. Emerg Top Life Sci 2019; 3:335-341. [DOI: 10.1042/etls20180175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Revised: 05/27/2019] [Accepted: 06/04/2019] [Indexed: 01/22/2023]
Abstract
Abstract
Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect the omics data from hundreds of thousands of individuals and to study the gene–disease association without the aid of prior assumptions about the trait biology. Despite the many advantages of modern omics technologies, interpretations of big data produced by such technologies require advanced computational algorithms. I outline key challenges that biomedical researches are facing when interpreting and integrating big omics data. I discuss the reproducibility aspect of big data analysis in the life sciences and review current practices in reproducible research. Finally, I explain the skills that biomedical researchers need to acquire to independently analyze big omics data.
Collapse
|
17
|
Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, Hill B, Brito J, Littman RJ, Statz B, Lam AKM, Dayama G, Grieneisen L, Martin LS, Flint J, Eskin E, Blekhman R. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol 2019; 17:e3000333. [PMID: 31220077 PMCID: PMC6605654 DOI: 10.1371/journal.pbio.3000333] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/02/2019] [Indexed: 01/07/2023] Open
Abstract
Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed "easy to install," and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Thiago Mosqueiro
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Richard J. Abdill
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Dat Duong
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Keith Mitchell
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Varuni Sarwal
- Indian Institute of Technology Delhi, Hauz Khas, New Delhi, India
| | - Brian Hill
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jaqueline Brito
- Institute of Mathematics and Computer Science, University of São Paulo, São Paulo, Brazil
| | - Russell Jared Littman
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Benjamin Statz
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
| | - Gargi Dayama
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Laura Grieneisen
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Lana S. Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, California, United States of America
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, Los Angeles, California, United States of America
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| | - Ran Blekhman
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota, United States of America
- Department of Ecology, Evolution, and Behavior, University of Minnesota, Minnesota, United States of America
| |
Collapse
|
18
|
Mangul S, Martin LS, Hill BL, Lam AKM, Distler MG, Zelikovsky A, Eskin E, Flint J. Systematic benchmarking of omics computational tools. Nat Commun 2019; 10:1393. [PMID: 30918265 PMCID: PMC6437167 DOI: 10.1038/s41467-019-09406-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/06/2019] [Indexed: 01/11/2023] Open
Abstract
Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. Benchmarking studies are important for comprehensively understanding and evaluating different computational omics methods. Here, the authors review practices from 25 recent studies and propose principles to improve the quality of benchmarking studies.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA. .,Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA.
| | - Lana S Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E Young Drive East, Los Angeles, CA, 90095, USA
| | - Brian L Hill
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Angela Ka-Mei Lam
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA
| | - Margaret G Distler
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA.,The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA.,Department of Human Genetics, University of California Los Angeles, 695 Charles E. Young, Los Angeles, CA, USA
| | - Jonathan Flint
- Department of Psychiatry and Biobehavioral Sciences, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| |
Collapse
|
19
|
Mangul S, Martin LS, Eskin E, Blekhman R. Improving the usability and archival stability of bioinformatics software. Genome Biol 2019; 20:47. [PMID: 30813962 PMCID: PMC6391762 DOI: 10.1186/s13059-019-1649-8] [Citation(s) in RCA: 45] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication.
Collapse
Affiliation(s)
- Serghei Mangul
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA. .,Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA, 90095, USA.
| | - Lana S Martin
- Institute for Quantitative and Computational Biosciences, University of California Los Angeles, 611 Charles E. Young Drive East, Los Angeles, CA, 90095, USA
| | - Eleazar Eskin
- Department of Computer Science, University of California Los Angeles, 580 Portola Plaza, Los Angeles, CA, 90095, USA.,Department of Human Genetics, University of California Los Angeles, 695 Charles E. Young Drive South, Los Angeles, CA, 90095, USA
| | - Ran Blekhman
- Department of Genetics, Cell Biology and Development, University of Minnesota, 321 Church St SE, Minneapolis, MN, 55455, USA.,Department of Ecology, Evolution, and Behavior, University of Minnesota, 100 Ecology Building, 1987 Upper Buford Cir, Falcon Heights, MN, 55108, USA
| |
Collapse
|
20
|
Dozmorov MG. GitHub Statistics as a Measure of the Impact of Open-Source Bioinformatics Software. Front Bioeng Biotechnol 2018; 6:198. [PMID: 30619845 PMCID: PMC6306043 DOI: 10.3389/fbioe.2018.00198] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 12/04/2018] [Indexed: 11/13/2022] Open
Abstract
Modern research is increasingly data-driven and reliant on bioinformatics software. Publication is a common way of introducing new software, but not all bioinformatics tools get published. Giving there are competing tools, it is important not merely to find the appropriate software, but have a metric for judging its usefulness. Journal's impact factor has been shown to be a poor predictor of software popularity; consequently, focusing on publications in high-impact journals limits user's choices in finding useful bioinformatics tools. Free and open source software repositories on popular code sharing platforms such as GitHub provide another venue to follow the latest bioinformatics trends. The open source component of GitHub allows users to bookmark and copy repositories that are most useful to them. This Perspective aims to demonstrate the utility of GitHub "stars," "watchers," and "forks" (GitHub statistics) as a measure of software impact. We compiled lists of impactful bioinformatics software and analyzed commonly used impact metrics and GitHub statistics of 50 genomics-oriented bioinformatics tools. We present examples of community-selected best bioinformatics resources and show that GitHub statistics are distinct from the journal's impact factor (JIF), citation counts, and alternative metrics (Altmetrics, CiteScore) in capturing the level of community attention. We suggest the use of GitHub statistics as an unbiased measure of the usability of bioinformatics software complementing the traditional impact metrics.
Collapse
Affiliation(s)
- Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
21
|
Imker HJ. 25 Years of Molecular Biology Databases: A Study of Proliferation, Impact, and Maintenance. Front Res Metr Anal 2018. [DOI: 10.3389/frma.2018.00018] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
22
|
Wren JD, Georgescu C, Giles CB, Hennessey J. Use it or lose it: citations predict the continued online availability of published bioinformatics resources. Nucleic Acids Res 2017; 45:3627-3633. [PMID: 28334982 PMCID: PMC5397159 DOI: 10.1093/nar/gkx182] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2016] [Accepted: 03/08/2017] [Indexed: 11/16/2022] Open
Abstract
Scientific Data Analysis Resources (SDARs) such as bioinformatics programs, web servers and databases are integral to modern science, but previous studies have shown that the Uniform Resource Locators (URLs) linking to them decay in a time-dependent manner, with ∼27% decayed to date. Because SDARs are overrepresented among science's most cited papers over the past 20 years, loss of widely used SDARs could be particularly disruptive to scientific research. We identified URLs in MEDLINE abstracts and used crowdsourcing to identify which reported the creation of SDARs. We used the Internet Archive's Wayback Machine to approximate ‘death dates’ and calculate citations/year over each SDAR's lifespan. At first glance, decayed SDARs did not significantly differ from available SDARs in their average citations per year over their lifespan or journal impact factor (JIF). But the most cited SDARs were 94% likely to be relocated to another URL versus only 34% of uncited ones. Taking relocation into account, we find that citations are the strongest predictors of current online availability after time since publication, and JIF modestly predictive. This suggests that URL decay is a general, persistent phenomenon affecting all URLs, but the most useful/recognized SDARs are more likely to persist.
Collapse
Affiliation(s)
- Jonathan D Wren
- Oklahoma Medical Research Foundation, Oklahoma City, Arthritis and Clinical Immunology Research Program, 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA.,University of Oklahoma Health Sciences Center, Department of Biochemistry and Molecular Biology, 940 Stanton L. Young Blvd, OK 73104-5005, USA
| | - Constantin Georgescu
- Oklahoma Medical Research Foundation, Oklahoma City, Arthritis and Clinical Immunology Research Program, 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA
| | - Cory B Giles
- Oklahoma Medical Research Foundation, Oklahoma City, Arthritis and Clinical Immunology Research Program, 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA
| | - Jason Hennessey
- Computer Science Department, Boston University, 111 Cummington Mall, Boston, MA 02215, USA
| |
Collapse
|