1
|
Pérez-Pérez M, Pérez-Rodríguez G, Blanco-Míguez A, Fdez-Riverola F, Valencia A, Krallinger M, Lourenço A. Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm. J Cheminform 2019; 11:42. [PMID: 31236786 PMCID: PMC6591930 DOI: 10.1186/s13321-019-0363-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Accepted: 06/09/2019] [Indexed: 11/23/2022] Open
Abstract
Background Shared tasks and community challenges represent key instruments to promote research, collaboration and determine the state of the art of biomedical and chemical text mining technologies. Traditionally, such tasks relied on the comparison of automatically generated results against a so-called Gold Standard dataset of manually labelled textual data, regardless of efficiency and robustness of the underlying implementations. Due to the rapid growth of unstructured data collections, including patent databases and particularly the scientific literature, there is a pressing need to generate, assess and expose robust big data text mining solutions to semantically enrich documents in real time. To address this pressing need, a novel track called “Technical interoperability and performance of annotation servers” was launched under the umbrella of the BioCreative text mining evaluation effort. The aim of this track was to enable the continuous assessment of technical aspects of text annotation web servers, specifically of online biomedical named entity recognition systems of interest for medicinal chemistry applications. Results A total of 15 out of 26 registered teams successfully implemented online annotation servers. They returned predictions during a two-month period in predefined formats and were evaluated through the BeCalm evaluation platform, specifically developed for this track. The track encompassed three levels of evaluation, i.e. data format considerations, technical metrics and functional specifications. Participating annotation servers were implemented in seven different programming languages and covered 12 general entity types. The continuous evaluation of server responses accounted for testing periods of low activity and moderate to high activity, encompassing overall 4,092,502 requests from three different document provider settings. The median response time was below 3.74 s, with a median of 10 annotations/document. Most of the servers showed great reliability and stability, being able to process over 100,000 requests in a 5-day period. Conclusions The presented track was a novel experimental task that systematically evaluated the technical performance aspects of online entity recognition systems. It raised the interest of a significant number of participants. Future editions of the competition will address the ability to process documents in bulk as well as to annotate full-text documents. Electronic supplementary material The online version of this article (10.1186/s13321-019-0363-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Martin Pérez-Pérez
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Gael Pérez-Rodríguez
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Aitor Blanco-Míguez
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain.,Department of Microbiology and Biochemistry of Dairy Products, Instituto de Productos Lácteos de Asturias (IPLA), Consejo Superior de Investigaciones Científicas (CSIC), Paseo Río Linares S/N 33300, Villaviciosa, Asturias, Spain
| | - Florentino Fdez-Riverola
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain.,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain.,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), C/Jordi Girona 29-31, 08034, Barcelona, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, C/Baldiri Reixac 10, 08028, Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Passeig de Lluís Companys 23, 08010, Barcelona, Spain.,Spanish Bioinformatics Institute INB-ISCIII ES-ELIXIR, 28029, Madrid, Spain
| | - Martin Krallinger
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS), C/Jordi Girona 29-31, 08034, Barcelona, Spain. .,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona, C/Baldiri Reixac 10, 08028, Barcelona, Spain. .,Biological Text Mining Unit, Structural Biology and Biocomputing Programme, Spanish National Cancer Research Centre, C/Melchor Fernández Almagro 3, 28029, Madrid, Spain.
| | - Anália Lourenço
- Department of Computer Science, ESEI, University of Vigo, Campus As Lagoas, 32004, Ourense, Spain. .,The Biomedical Research Centre (CINBIO), Campus Universitario Lagoas-Marcosende, 36310, Vigo, Spain. .,SING Research Group, Galicia Sur Health Research Institute (ISS Galicia Sur), SERGAS-UVIGO, Vigo, Spain. .,Centre of Biological Engineering (CEB), University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal.
| |
Collapse
|
3
|
Talikka M, Bukharov N, Hayes WS, Hofmann-Apitius M, Alexopoulos L, Peitsch MC, Hoeng J. Novel approaches to develop community-built biological network models for potential drug discovery. Expert Opin Drug Discov 2017; 12:849-857. [PMID: 28585481 DOI: 10.1080/17460441.2017.1335302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
INTRODUCTION Hundreds of thousands of data points are now routinely generated in clinical trials by molecular profiling and NGS technologies. A true translation of this data into knowledge is not possible without analysis and interpretation in a well-defined biology context. Currently, there are many public and commercial pathway tools and network models that can facilitate such analysis. At the same time, insights and knowledge that can be gained is highly dependent on the underlying biological content of these resources. Crowdsourcing can be employed to guarantee the accuracy and transparency of the biological content underlining the tools used to interpret rich molecular data. Areas covered: In this review, the authors describe crowdsourcing in drug discovery. The focal point is the efforts that have successfully used the crowdsourcing approach to verify and augment pathway tools and biological network models. Technologies that enable the building of biological networks with the community are also described. Expert opinion: A crowd of experts can be leveraged for the entire development process of biological network models, from ontologies to the evaluation of their mechanistic completeness. The ultimate goal is to facilitate biomarker discovery and personalized medicine by mechanistically explaining patients' differences with respect to disease prevention, diagnosis, and therapy outcome.
Collapse
Affiliation(s)
- Marja Talikka
- a Philip Morris International R&D , Philip Morris Products S.A. , Neuchâtel , Switzerland
| | - Natalia Bukharov
- b Translational Data Management Services, Clarivate Analytics (Formerly the IP & Science Business of Thomson Reuters) , Boston , MA , USA
| | - William S Hayes
- c Data Sciences , Applied Dynamic Solutions, LLC , Rahway , NJ , USA
| | - Martin Hofmann-Apitius
- d Department of Bioinformatics , Fraunhofer Institute for Algorithms and Scientific Computing, Schloss Birlinghoven , Sankt Augustin , Germany
| | - Leonidas Alexopoulos
- e Systems Bioengineering Lab , National Technical University of Athens , Zografou , Greece.,f Protavio Ltd , Stevenage , UK
| | - Manuel C Peitsch
- a Philip Morris International R&D , Philip Morris Products S.A. , Neuchâtel , Switzerland
| | - Julia Hoeng
- a Philip Morris International R&D , Philip Morris Products S.A. , Neuchâtel , Switzerland
| |
Collapse
|
4
|
Krallinger M, Rabal O, Lourenço A, Oyarzabal J, Valencia A. Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev 2017; 117:7673-7761. [PMID: 28475312 DOI: 10.1021/acs.chemrev.6b00851] [Citation(s) in RCA: 129] [Impact Index Per Article: 16.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.
Collapse
Affiliation(s)
- Martin Krallinger
- Structural Computational Biology Group, Structural Biology and BioComputing Programme, Spanish National Cancer Research Centre , C/Melchor Fernández Almagro 3, Madrid E-28029, Spain
| | - Obdulia Rabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Anália Lourenço
- ESEI - Department of Computer Science, University of Vigo , Edificio Politécnico, Campus Universitario As Lagoas s/n, Ourense E-32004, Spain.,Centro de Investigaciones Biomédicas (Centro Singular de Investigación de Galicia) , Campus Universitario Lagoas-Marcosende, Vigo E-36310, Spain.,CEB-Centre of Biological Engineering, University of Minho , Campus de Gualtar, Braga 4710-057, Portugal
| | - Julen Oyarzabal
- Small Molecule Discovery Platform, Molecular Therapeutics Program, Center for Applied Medical Research (CIMA), University of Navarra , Avenida Pio XII 55, Pamplona E-31008, Spain
| | - Alfonso Valencia
- Life Science Department, Barcelona Supercomputing Centre (BSC-CNS) , C/Jordi Girona, 29-31, Barcelona E-08034, Spain.,Joint BSC-IRB-CRG Program in Computational Biology, Parc Científic de Barcelona , C/ Baldiri Reixac 10, Barcelona E-08028, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA) , Passeig de Lluís Companys 23, Barcelona E-08010, Spain
| |
Collapse
|
5
|
Mukherjee S, Stamatis D, Bertsch J, Ovchinnikova G, Verezemska O, Isbandi M, Thomas AD, Ali R, Sharma K, Kyrpides NC, Reddy TBK. Genomes OnLine Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res 2017; 45:D446-D456. [PMID: 27794040 PMCID: PMC5210664 DOI: 10.1093/nar/gkw992] [Citation(s) in RCA: 111] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Revised: 10/11/2016] [Accepted: 10/19/2016] [Indexed: 01/28/2023] Open
Abstract
The Genomes Online Database (GOLD) (https://gold.jgi.doe.gov) is a manually curated data management system that catalogs sequencing projects with associated metadata from around the world. In the current version of GOLD (v.6), all projects are organized based on a four level classification system in the form of a Study, Organism (for isolates) or Biosample (for environmental samples), Sequencing Project and Analysis Project. Currently, GOLD provides information for 26 117 Studies, 239 100 Organisms, 15 887 Biosamples, 97 212 Sequencing Projects and 78 579 Analysis Projects. These are integrated with over 312 metadata fields from which 58 are controlled vocabularies with 2067 terms. The web interface facilitates submission of a diverse range of Sequencing Projects (such as isolate genome, single-cell genome, metagenome, metatranscriptome) and complex Analysis Projects (such as genome from metagenome, or combined assembly from multiple Sequencing Projects). GOLD provides a seamless interface with the Integrated Microbial Genomes (IMG) system and supports and promotes the Genomic Standards Consortium (GSC) Minimum Information standards. This paper describes the data updates and additional features added during the last two years.
Collapse
Affiliation(s)
- Supratim Mukherjee
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Dimitri Stamatis
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Jon Bertsch
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Galina Ovchinnikova
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Olena Verezemska
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Michelle Isbandi
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Alex D Thomas
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Rida Ali
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Kaushal Sharma
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| | - Nikos C Kyrpides
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
- Department of Biological Sciences, Faculty of Science, King Abdulaziz University, Jeddah, Saudi Arabia
| | - T B K Reddy
- Prokaryotic Super Program, DOE Joint Genome Institute, Walnut Creek, 94598 CA, USA
| |
Collapse
|