4
|
Groza T, Köhler S, Doelken S, Collier N, Oellrich A, Smedley D, Couto FM, Baynam G, Zankl A, Robinson PN. Automatic concept recognition using the human phenotype ontology reference and test suite corpora. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav005. [PMID: 25725061 PMCID: PMC4343077 DOI: 10.1093/database/bav005] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Concept recognition tools rely on the availability of textual corpora to assess their performance and enable the identification of areas for improvement. Typically, corpora are developed for specific purposes, such as gene name recognition. Gene and protein name identification are longstanding goals of biomedical text mining, and therefore a number of different corpora exist. However, phenotypes only recently became an entity of interest for specialized concept recognition systems, and hardly any annotated text is available for performance testing and training. Here, we present a unique corpus, capturing text spans from 228 abstracts manually annotated with Human Phenotype Ontology (HPO) concepts and harmonized by three curators, which can be used as a reference standard for free text annotation of human phenotypes. Furthermore, we developed a test suite for standardized concept recognition error analysis, incorporating 32 different types of test cases corresponding to 2164 HPO concepts. Finally, three established phenotype concept recognizers (NCBO Annotator, OBO Annotator and Bio-LarK CR) were comprehensively evaluated, and results are reported against both the text corpus and the test suites. The gold standard and test suites corpora are available from http://bio-lark.org/hpo_res.html. Database URL:http://bio-lark.org/hpo_res.html
Collapse
Affiliation(s)
- Tudor Groza
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informa
| | - Sebastian Köhler
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany
| | - Sandra Doelken
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany
| | - Nigel Collier
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informa
| | - Anika Oellrich
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany
| | - Damian Smedley
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany
| | - Francisco M Couto
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany
| | - Gareth Baynam
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informa
| | - Andreas Zankl
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informa
| | - Peter N Robinson
- School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informa
| |
Collapse
|