1
|
Marušić A, Malički M, von Elm E. Editorial research and the publication process in biomedicine and health: Report from the Esteve Foundation Discussion Group, December 2012. Biochem Med (Zagreb) 2014; 24:211-6. [PMID: 24969914 PMCID: PMC4083572 DOI: 10.11613/bm.2014.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2014] [Accepted: 05/19/2014] [Indexed: 11/24/2022] Open
Abstract
Despite the fact that there are more than twenty thousand biomedical journals in the world, research into the work of editors and publication process in biomedical and health care journals is rare. In December 2012, the Esteve Foundation, a non-profit scientific institution that fosters progress in pharmacotherapy by means of scientific communication and discussion organized a discussion group of 7 editors and/or experts in peer review biomedical publishing. They presented findings of past editorial research, discussed the lack of competitive funding schemes and specialized journals for dissemination of editorial research, and reported on the great diversity of misconduct and conflict of interest policies, as well as adherence to reporting guidelines. Furthermore, they reported on the reluctance of editors to investigate allegations of misconduct or increase the level of data sharing in health research. In the end, they concluded that if editors are to remain gatekeepers of scientific knowledge they should reaffirm their focus on the integrity of the scientific record and completeness of the data they publish. Additionally, more research should be undertaken to understand why many journals are not adhering to editorial standards, and what obstacles editors face when engaging in editorial research.
Collapse
Affiliation(s)
- Ana Marušić
- Department of Research in Biomedicine and Health, University of Split School of Medicine, Split, Croatia
| | | | | |
Collapse
|
2
|
He Y. Ontology-supported research on vaccine efficacy, safety and integrative biological networks. Expert Rev Vaccines 2014; 13:825-41. [PMID: 24909153 DOI: 10.1586/14760584.2014.923762] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
While vaccine efficacy and safety research has dramatically progressed with the methods of in silico prediction and data mining, many challenges still exist. A formal ontology is a human- and computer-interpretable set of terms and relations that represent entities in a specific domain and how these terms relate to each other. Several community-based ontologies (including Vaccine Ontology, Ontology of Adverse Events and Ontology of Vaccine Adverse Events) have been developed to support vaccine and adverse event representation, classification, data integration, literature mining of host-vaccine interaction networks, and analysis of vaccine adverse events. The author further proposes minimal vaccine information standards and their ontology representations, ontology-based linked open vaccine data and meta-analysis, an integrative One Network ('OneNet') Theory of Life, and ontology-based approaches to study and apply the OneNet theory. In the Big Data era, these proposed strategies provide a novel framework for advanced data integration and analysis of fundamental biological networks including vaccine immune mechanisms.
Collapse
Affiliation(s)
- Yongqun He
- Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, USA
| |
Collapse
|
3
|
Cheung WA, Ouellette BFF, Wasserman WW. Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs). BMC Bioinformatics 2012; 13:249. [PMID: 23017167 PMCID: PMC3564935 DOI: 10.1186/1471-2105-13-249] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2012] [Accepted: 09/24/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND MEDLINE®/PubMed® indexes over 20 million biomedical articles, providing curated annotation of its contents using a controlled vocabulary known as Medical Subject Headings (MeSH). The MeSH vocabulary, developed over 50+ years, provides a broad coverage of topics across biomedical research. Distilling the essential biomedical themes for a topic of interest from the relevant literature is important to both understand the importance of related concepts and discover new relationships. RESULTS We introduce a novel method for determining enriched curator-assigned MeSH annotations in a set of papers associated to a topic, such as a gene, an author or a disease. We generate MeSH Over-representation Profiles (MeSHOPs) to quantitatively summarize the annotations in a form convenient for further computational analysis and visualization. Based on a hypergeometric distribution of assigned terms, MeSHOPs statistically account for the prevalence of the associated biomedical annotation while highlighting unusually prevalent terms based on a specified background. MeSHOPs can be visualized using word clouds, providing a succinct quantitative graphical representation of the relative importance of terms. Using the publication dates of articles, MeSHOPs track changing patterns of annotation over time. Since MeSHOPs are quantitative vectors, MeSHOPs can be compared using standard techniques such as hierarchical clustering. The reliability of MeSHOP annotations is assessed based on the capacity to re-derive the subset of the Gene Ontology annotations with equivalent MeSH terms. CONCLUSIONS MeSHOPs allows quantitative measurement of the degree of association between any entity and the annotated medical concepts, based directly on relevant primary literature. Comparison of MeSHOPs allows entities to be related based on shared medical themes in their literature. A web interface is provided for generating and visualizing MeSHOPs.
Collapse
Affiliation(s)
- Warren A Cheung
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
- Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada
| | - BF Francis Ouellette
- Ontario Institute for Cancer Research, Toronto, ON, Canada
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
4
|
Huang M, Névéol A, Lu Z. Recommending MeSH terms for annotating biomedical articles. J Am Med Inform Assoc 2011; 18:660-7. [PMID: 21613640 PMCID: PMC3168302 DOI: 10.1136/amiajnl-2010-000055] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2010] [Accepted: 04/08/2011] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Due to the high cost of manual curation of key aspects from the scientific literature, automated methods for assisting this process are greatly desired. Here, we report a novel approach to facilitate MeSH indexing, a challenging task of assigning MeSH terms to MEDLINE citations for their archiving and retrieval. METHODS Unlike previous methods for automatic MeSH term assignment, we reformulate the indexing task as a ranking problem such that relevant MeSH headings are ranked higher than those irrelevant ones. Specifically, for each document we retrieve 20 neighbor documents, obtain a list of MeSH main headings from neighbors, and rank the MeSH main headings using ListNet-a learning-to-rank algorithm. We trained our algorithm on 200 documents and tested on a previously used benchmark set of 200 documents and a larger dataset of 1000 documents. RESULTS Tested on the benchmark dataset, our method achieved a precision of 0.390, recall of 0.712, and mean average precision (MAP) of 0.626. In comparison to the state of the art, we observe statistically significant improvements as large as 39% in MAP (p-value <0.001). Similar significant improvements were also obtained on the larger document set. CONCLUSION Experimental results show that our approach makes the most accurate MeSH predictions to date, which suggests its great potential in making a practical impact on MeSH indexing. Furthermore, as discussed the proposed learning framework is robust and can be adapted to many other similar tasks beyond MeSH indexing in the biomedical domain. All data sets are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/indexing.
Collapse
Affiliation(s)
- Minlie Huang
- State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, PR China
- National Center for Biotechnology Information (NCBI), US National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Aurélie Névéol
- National Center for Biotechnology Information (NCBI), US National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), US National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
5
|
He X, Sarma MS, Ling X, Chee B, Zhai C, Schatz B. Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model. BMC Bioinformatics 2010; 11:272. [PMID: 20487560 PMCID: PMC2885378 DOI: 10.1186/1471-2105-11-272] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2009] [Accepted: 05/20/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered. RESULTS We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results. CONCLUSIONS We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp.
Collapse
Affiliation(s)
- Xin He
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | | | | | | | | | | |
Collapse
|
6
|
Zheng HT, Borchert C, Jiang Y. A knowledge-driven approach to biomedical document conceptualization. Artif Intell Med 2010; 49:67-78. [PMID: 20371168 DOI: 10.1016/j.artmed.2010.02.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2009] [Revised: 01/19/2010] [Accepted: 02/22/2010] [Indexed: 10/19/2022]
Abstract
OBJECTIVE Biomedical document conceptualization is the process of clustering biomedical documents based on ontology-represented domain knowledge. The result of this process is the representation of the biomedical documents by a set of key concepts and their relationships. Most of clustering methods cluster documents based on invariant domain knowledge. The objective of this work is to develop an effective method to cluster biomedical documents based on various user-specified ontologies, so that users can exploit the concept structures of documents more effectively. METHODS We develop a flexible framework to allow users to specify the knowledge bases, in the form of ontologies. Based on the user-specified ontologies, we develop a key concept induction algorithm, which uses latent semantic analysis to identify key concepts and cluster documents. A corpus-related ontology generation algorithm is developed to generate the concept structures of documents. RESULTS Based on two biomedical datasets, we evaluate the proposed method and five other clustering algorithms. The clustering results of the proposed method outperform the five other algorithms, in terms of key concept identification. With respect to the first biomedical dataset, our method has the F-measure values 0.7294 and 0.5294 based on the MeSH ontology and gene ontology (GO), respectively. With respect to the second biomedical dataset, our method has the F-measure values 0.6751 and 0.6746 based on the MeSH ontology and GO, respectively. Both results outperforms the five other algorithms in terms of F-measure. Based on the MeSH ontology and GO, the generated corpus-related ontologies show informative conceptual structures. CONCLUSIONS The proposed method enables users to specify the domain knowledge to exploit the conceptual structures of biomedical document collections. In addition, the proposed method is able to extract the key concepts and cluster the documents with a relatively high precision.
Collapse
Affiliation(s)
- Hai-Tao Zheng
- Tsinghua-Southampton Web Science Laboratory at Shenzhen, Graduate School at Shenzhen, Tsinghua University, Shenzhen, China.
| | | | | |
Collapse
|
7
|
Jani SD, Argraves GL, Barth JL, Argraves WS. GeneMesh: a web-based microarray analysis tool for relating differentially expressed genes to MeSH terms. BMC Bioinformatics 2010; 11:166. [PMID: 20359363 PMCID: PMC3212930 DOI: 10.1186/1471-2105-11-166] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2009] [Accepted: 04/01/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND An important objective of DNA microarray-based gene expression experimentation is determining inter-relationships that exist between differentially expressed genes and biological processes, molecular functions, cellular components, signaling pathways, physiologic processes and diseases. RESULTS Here we describe GeneMesh, a web-based program that facilitates analysis of DNA microarray gene expression data. GeneMesh relates genes in a query set to categories available in the Medical Subject Headings (MeSH) hierarchical index. The interface enables hypothesis driven relational analysis to a specific MeSH subcategory (e.g., Cardiovascular System, Genetic Processes, Immune System Diseases etc.) or unbiased relational analysis to broader MeSH categories (e.g., Anatomy, Biological Sciences, Disease etc.). Genes found associated with a given MeSH category are dynamically linked to facilitate tabular and graphical depiction of Entrez Gene information, Gene Ontology information, KEGG metabolic pathway diagrams and intermolecular interaction information. Expression intensity values of groups of genes that cluster in relation to a given MeSH category, gene ontology or pathway can be displayed as heat maps of Z score-normalized values. GeneMesh operates on gene expression data derived from a number of commercial microarray platforms including Affymetrix, Agilent and Illumina. CONCLUSIONS GeneMesh is a versatile web-based tool for testing and developing new hypotheses through relating genes in a query set (e.g., differentially expressed genes from a DNA microarray experiment) to descriptors making up the hierarchical structure of the National Library of Medicine controlled vocabulary thesaurus, MeSH. The system further enhances the discovery process by providing links between sets of genes associated with a given MeSH category to a rich set of html linked tabular and graphic information including Entrez Gene summaries, gene ontologies, intermolecular interactions, overlays of genes onto KEGG pathway diagrams and heatmaps of expression intensity values. GeneMesh is freely available online at http://proteogenomics.musc.edu/genemesh/.
Collapse
Affiliation(s)
- Saurin D Jani
- Department of Regenerative Medicine and Cell Biology, Medical University of South Carolina, Charleston, SC 29425, USA
| | | | | | | |
Collapse
|
8
|
Piwowar HA, Chapman WW. Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers. JOURNAL OF BIOMEDICAL DISCOVERY AND COLLABORATION 2010; 5:7-20. [PMID: 20349403 PMCID: PMC2990274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2009] [Revised: 03/08/2010] [Accepted: 12/04/2009] [Indexed: 11/12/2022]
Abstract
BACKGROUND The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles. RESULTS In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to only 76.6% found by our search carried out using PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. There was no difference in the type of datasets found by PubMed identifier searches in terms of research theme or the technology used. However, the studies identified were more likely to have larger sample sizes, were more frequently cited, and published in higher impact journals. CONCLUSIONS Searching database entries using PubMed identifiers can identify the majority of publicly available datasets, but caution is required when this method is used to collect data for policy evaluation since studies in low impact journals are disproportionately excluded. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.
Collapse
Affiliation(s)
- Heather A Piwowar
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA USA
| | - Wendy W Chapman
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA USA
| |
Collapse
|
9
|
Zheng HT, Borchert C, Kim HG. GOClonto: an ontological clustering approach for conceptualizing PubMed abstracts. J Biomed Inform 2009; 43:31-40. [PMID: 19635585 DOI: 10.1016/j.jbi.2009.07.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2008] [Revised: 05/21/2009] [Accepted: 07/20/2009] [Indexed: 10/20/2022]
Abstract
Concurrent with progress in biomedical sciences, an overwhelming of textual knowledge is accumulating in the biomedical literature. PubMed is the most comprehensive database collecting and managing biomedical literature. To help researchers easily understand collections of PubMed abstracts, numerous clustering methods have been proposed to group similar abstracts based on their shared features. However, most of these methods do not explore the semantic relationships among groupings of documents, which could help better illuminate the groupings of PubMed abstracts. To address this issue, we proposed an ontological clustering method called GOClonto for conceptualizing PubMed abstracts. GOClonto uses latent semantic analysis (LSA) and gene ontology (GO) to identify key gene-related concepts and their relationships as well as allocate PubMed abstracts based on these key gene-related concepts. Based on two PubMed abstract collections, the experimental results show that GOClonto is able to identify key gene-related concepts and outperforms the STC (suffix tree clustering) algorithm, the Lingo algorithm, the Fuzzy Ants algorithm, and the clustering based TRS (tolerance rough set) algorithm. Moreover, the two ontologies generated by GOClonto show significant informative conceptual structures.
Collapse
Affiliation(s)
- Hai-Tao Zheng
- Biomedical Knowledge Engineering Laboratory, BK21 College of Dentistry, Seoul National University, 28 Yeongeon-dong, Jongro-gu, Seoul 110-749, Republic of Korea
| | | | | |
Collapse
|
10
|
Abstract
The promise of the genome project was that a complete sequence would provide us with information that would transform biology and medicine. But the 'parts list' that has emerged from the genome project is far from the 'wiring diagram' and 'circuit logic' we need to understand the link between genotype, environment and phenotype. While genomic technologies such as DNA microarrays, proteomics and metabolomics have given us new tools and new sources of data to address these problems, a number of crucial elements remain to be addressed before we can begin to close the loop and develop a predictive quantitative biology that is the stated goal of so much of current biological research, including systems biology. Our approach to this problem has largely been one of integration, bringing together a vast wealth of information to better interpret the experimental data we are generating in genomic assays and creating publicly available databases and software tools to facilitate the work of others. Recently, we have used a similar approach to trying to understand the biological networks that underlie the phenotypic responses we observe and starting us on the road to developing a predictive biology.
Collapse
Affiliation(s)
- John Quackenbush
- Department of Biostatistics and Computational Biology and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|
11
|
Bresell A, Servenius B, Persson B. Ontology annotation treebrowser : an interactive tool where the complementarity of medical subject headings and gene ontology improves the interpretation of gene lists. ACTA ACUST UNITED AC 2007; 5:225-36. [PMID: 17140269 DOI: 10.2165/00822942-200605040-00005] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Gene expression and proteomics analysis allow the investigation of thousands of biomolecules in parallel. This results in a long list of interesting genes or proteins and a list of annotation terms in the order of thousands. It is not a trivial task to understand such a gene list and it would require extensive efforts to bring together the overwhelming amounts of associated information from the literature and databases. Thus, it is evident that we need ways of condensing and filtering this information. An excellent way to represent knowledge is to use ontologies, where it is possible to group genes or terms with overlapping context, rather than studying one-dimensional lists of keywords. Therefore, we have built the ontology annotation treebrowser (OAT) to represent, condense, filter and summarise the knowledge associated with a list of genes or proteins. The OAT system consists of two disjointed parts; a MySQL database named OATdb, and a treebrowser engine that is implemented as a web interface. The OAT system is implemented using Perl scripts on an Apache web server and the gene, ontology and annotation data is stored in a relational MySQL database. In OAT, we have harmonized the two ontologies of medical subject headings (MeSH) and gene ontology (GO), to enable us to use knowledge both from the literature and the annotation projects in the same tool. OAT includes multiple gene identifier sets, which are merged internally in the OAT database. We have also generated novel MeSH annotations by mapping accession numbers to MEDLINE entries. The ontology browser OAT was created to facilitate the analysis of gene lists. It can be browsed dynamically, so that a scientist can interact with the data and govern the outcome. Test statistics show which branches are enriched. We also show that the two ontologies complement each other, with surprisingly low overlap, by mapping annotations to the Unified Medical Language System. We have developed a novel interactive annotation browser that is the first to incorporate both MeSH and GO for improved interpretation of gene lists. With OAT, we illustrate the benefits of combining MeSH and GO for understanding gene lists. OAT is available as a public web service at: http://www.ifm.liu.se/bioinfo/oat.
Collapse
Affiliation(s)
- Anders Bresell
- IFM Bioinformatics, Linköping University, Linköping, Sweden.
| | | | | |
Collapse
|
12
|
Abstract
Technologies that have emerged from the genome project have dramatically increased our ability to generate data on the way in which organisms respond to their environments, how they execute their programmes of development and growth, and how these are altered in the development of disease states. However, our ability to analyse these large datasets has not kept pace with our ability to generate them and consequently new strategies must be developed to address the issues associated with their analysis. One approach that we have employed quite successfully is to look at data from microarrays (or proteomics or metabolomics experiments) not as independent datasets, but rather as elements of a much larger body of biological information across various scales that must be integrated with, and interpreted within, the context of such ancillary data. Here we outline the general approach and provide three examples from published studies of the way in which we have applied this strategy.
Collapse
Affiliation(s)
- J Quackenbush
- Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, MA 02115, USA.
| |
Collapse
|