1
|
Farrell MJ, Le Guillarme N, Brierley L, Hunter B, Scheepens D, Willoughby A, Yates A, Mideo N. The changing landscape of text mining: a review of approaches for ecology and evolution. Proc Biol Sci 2024; 291:20240423. [PMID: 39082244 PMCID: PMC11289731 DOI: 10.1098/rspb.2024.0423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 06/20/2024] [Accepted: 06/20/2024] [Indexed: 08/02/2024] Open
Abstract
In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.
Collapse
Affiliation(s)
- Maxwell J. Farrell
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
- School of Biodiversity, One Health & Veterinary Medicine, University of Glasgow, Glasgow, UK
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
| | - Nicolas Le Guillarme
- Université Grenoble Alpes, CNRS, LECA, Laboratoire d'Ecologie Alpine, Grenoble, France
| | - Liam Brierley
- MRC-University of Glasgow Centre for Virus Research, Glasgow, UK
- Department of Health Data Science, University of Liverpool, Liverpool, UK
| | - Bronwen Hunter
- School of Life Sciences, University of Sussex, Brighton, UK
| | - Daan Scheepens
- Division of Biosciences, University College London, London, UK
| | | | - Andrew Yates
- Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
| | - Nicole Mideo
- Department of Ecology & Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
2
|
Gabud R, Lapitan P, Mariano V, Mendoza E, Pampolina N, Clariño MAA, Batista-Navarro R. Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species. Front Artif Intell 2024; 7:1371411. [PMID: 38845683 PMCID: PMC11153722 DOI: 10.3389/frai.2024.1371411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 05/10/2024] [Indexed: 06/09/2024] Open
Abstract
Introduction Fine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species' time-sensitive reproductive conditions and location-specific habitats. Methods We sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches. Results Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.
Collapse
Affiliation(s)
- Roselyn Gabud
- Department of Computer Science, College of Engineering, University of the Philippines Diliman, Quezon City, Philippines
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
| | - Portia Lapitan
- Department of Forest Biological Sciences, College of Forestry and Natural Resources, University of the Philippines Los Baños, Laguna, Philippines
| | - Vladimir Mariano
- Young Southeast Asian Leaders Initiative (YSEALI) Academy, Fulbright University Vietnam, Ho Chi Minh City, Vietnam
| | - Eduardo Mendoza
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
- Mathematics and Statistics Department, De la Salle University, Manila, Philippines
- Center for Natural Science and Environmental Research, De la Salle University, Manila, Philippines
- Max Planck Institute of Biochemistry, Munich, Germany
| | - Nelson Pampolina
- Department of Forest Biological Sciences, College of Forestry and Natural Resources, University of the Philippines Los Baños, Laguna, Philippines
| | - Maria Art Antonette Clariño
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
| | - Riza Batista-Navarro
- Institute of Computer Science, College of Arts and Sciences, University of the Philippines Los Baños, Laguna, Philippines
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
3
|
On the Use of Knowledge Transfer Techniques for Biomedical Named Entity Recognition. FUTURE INTERNET 2023. [DOI: 10.3390/fi15020079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023] Open
Abstract
Biomedical named entity recognition (BioNER) is a preliminary task for many other tasks, e.g., relation extraction and semantic search. Extracting the text of interest from biomedical documents becomes more demanding as the availability of online data is increasing. Deep learning models have been adopted for biomedical named entity recognition (BioNER) as deep learning has been found very successful in many other tasks. Nevertheless, the complex structure of biomedical text data is still a challenging aspect for deep learning models. Limited annotated biomedical text data make it more difficult to train deep learning models with millions of trainable parameters. The single-task model, which focuses on learning a specific task, has issues in learning complex feature representations from a limited quantity of annotated data. Moreover, manually constructing annotated data is a time-consuming job. It is, therefore, vital to exploit other efficient ways to train deep learning models on the available annotated data. This work enhances the performance of the BioNER task by taking advantage of various knowledge transfer techniques: multitask learning and transfer learning. This work presents two multitask models (MTMs), which learn shared features and task-specific features by implementing the shared and task-specific layers. In addition, the presented trained MTM is also fine-tuned for each specific dataset to tailor it from a general features representation to a specialized features representation. The presented empirical results and statistical analysis from this work illustrate that the proposed techniques enhance significantly the performance of the corresponding single-task model (STM).
Collapse
|
4
|
Abdelmageed N, Löffler F, Feddoul L, Algergawy A, Samuel S, Gaikwad J, Kazem A, König-Ries B. BiodivNERE: Gold standard corpora for named entity recognition and relation extraction in the biodiversity domain. Biodivers Data J 2022; 10:e89481. [PMID: 36761617 PMCID: PMC9836593 DOI: 10.3897/bdj.10.e89481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Accepted: 09/07/2022] [Indexed: 11/12/2022] Open
Abstract
Background Biodiversity is the assortment of life on earth covering evolutionary, ecological, biological, and social forms. To preserve life in all its variety and richness, it is imperative to monitor the current state of biodiversity and its change over time and to understand the forces driving it. This need has resulted in numerous works being published in this field. With this, a large amount of textual data (publications) and metadata (e.g. dataset description) has been generated. To support the management and analysis of these data, two techniques from computer science are of interest, namely Named Entity Recognition (NER) and Relation Extraction (RE). While the former enables better content discovery and understanding, the latter fosters the analysis by detecting connections between entities and, thus, allows us to draw conclusions and answer relevant domain-specific questions. To automatically predict entities and their relations, machine/deep learning techniques could be used. The training and evaluation of those techniques require labelled corpora. New information In this paper, we present two gold-standard corpora for Named Entity Recognition (NER) and Relation Extraction (RE) generated from biodiversity datasets metadata and abstracts that can be used as evaluation benchmarks for the development of new computer-supported tools that require machine learning or deep learning techniques. These corpora are manually labelled and verified by biodiversity experts. In addition, we explain the detailed steps of constructing these datasets. Moreover, we demonstrate the underlying ontology for the classes and relations used to annotate such corpora.
Collapse
Affiliation(s)
- Nora Abdelmageed
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, GermanyMichael-Stifel-Center for Data-Driven and Simulation ScienceJenaGermany
| | - Felicitas Löffler
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Leila Feddoul
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Alsayed Algergawy
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Sheeba Samuel
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, GermanyMichael-Stifel-Center for Data-Driven and Simulation ScienceJenaGermany
| | - Jitendra Gaikwad
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany
| | - Anahita Kazem
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,German Center for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, GermanyGerman Center for Integrative Biodiversity Research (iDiv)Halle-Jena-LeipzigGermany
| | - Birgitta König-Ries
- Heinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, Jena, GermanyHeinz Nixdorf Chair for Distributed Information Systems, Department of Mathematics and Computer Science, Friedrich Schiller University JenaJenaGermany,Michael-Stifel-Center for Data-Driven and Simulation Science, Jena, GermanyMichael-Stifel-Center for Data-Driven and Simulation ScienceJenaGermany,German Center for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, GermanyGerman Center for Integrative Biodiversity Research (iDiv)Halle-Jena-LeipzigGermany
| |
Collapse
|
5
|
Le Guillarme N, Thuiller W. TaxoNERD: Deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. Methods Ecol Evol 2021. [DOI: 10.1111/2041-210x.13778] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Nicolas Le Guillarme
- CNRS LECA Laboratoire d'Ecologie Alpine Université Grenoble Alpes University Savoie Mont Blanc Grenoble France
| | - Wilfried Thuiller
- CNRS LECA Laboratoire d'Ecologie Alpine Université Grenoble Alpes University Savoie Mont Blanc Grenoble France
| |
Collapse
|
6
|
Lücking A, Driller C, Stoeckel M, Abrami G, Pachzelt A, Mehler A. Multiple annotation for biodiversity: developing an annotation framework among biology, linguistics and text technology. LANG RESOUR EVAL 2021. [DOI: 10.1007/s10579-021-09553-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
AbstractBiodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However, to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the bio text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of bio is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon- or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the bio annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.
Collapse
|
7
|
Little DP. Recognition of Latin scientific names using artificial neural networks. APPLICATIONS IN PLANT SCIENCES 2020; 8:e11378. [PMID: 32765977 PMCID: PMC7394707 DOI: 10.1002/aps3.11378] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 04/28/2020] [Indexed: 05/28/2023]
Abstract
PREMISE The automated recognition of Latin scientific names within vernacular text has many applications, including text mining, search indexing, and automated specimen-label processing. Most published solutions are computationally inefficient, incapable of running within a web browser, and focus on texts in English, thus omitting a substantial portion of biodiversity literature. METHODS AND RESULTS An open-source browser-executable solution, Quaesitor, is presented here. It uses pattern matching (regular expressions) in combination with an ensembled classifier composed of an inclusion dictionary search (Bloom filter), a trio of complementary neural networks that differ in their approach to encoding text, and word length to automatically identify Latin scientific names in the 16 most common languages for biodiversity articles. CONCLUSIONS In combination, the classifiers can recognize Latin scientific names in isolation or embedded within the languages used for >96% of biodiversity literature titles. For three different data sets, they resulted in a 0.80-0.97 recall and a 0.69-0.84 precision at a rate of 8.6 ms/word.
Collapse
Affiliation(s)
- Damon P. Little
- Lewis B. and Dorothy Cullman Program for Molecular SystematicsNew York Botanical GardenBronxNew York10458‐5126USA
- PhD Program in Plant BiologyGraduate CenterCity University of New YorkNew YorkNew York10016‐4309USA
| |
Collapse
|
8
|
Kim JD, Wang Y, Fujiwara T, Okuda S, Callahan TJ, Cohen KB. Open Agile text mining for bioinformatics: the PubAnnotation ecosystem. Bioinformatics 2020; 35:4372-4380. [PMID: 30937439 PMCID: PMC6821251 DOI: 10.1093/bioinformatics/btz227] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Revised: 03/16/2019] [Accepted: 03/29/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Most currently available text mining tools share two characteristics that make them less than optimal for use by biomedical researchers: they require extensive specialist skills in natural language processing and they were built on the assumption that they should optimize global performance metrics on representative datasets. This is a problem because most end-users are not natural language processing specialists and because biomedical researchers often care less about global metrics like F-measure or representative datasets than they do about more granular metrics such as precision and recall on their own specialized datasets. Thus, there are fundamental mismatches between the assumptions of much text mining work and the preferences of potential end-users. RESULTS This article introduces the concept of Agile text mining, and presents the PubAnnotation ecosystem as an example implementation. The system approaches the problems from two perspectives: it allows the reformulation of text mining by biomedical researchers from the task of assembling a complete system to the task of retrieving warehoused annotations, and it makes it possible to do very targeted customization of the pre-existing system to address specific end-user requirements. Two use cases are presented: assisted curation of the GlycoEpitope database, and assessing coverage in the literature of pre-eclampsia-associated genes. AVAILABILITY AND IMPLEMENTATION The three tools that make up the ecosystem, PubAnnotation, PubDictionaries and TextAE are publicly available as web services, and also as open source projects. The dictionaries and the annotation datasets associated with the use cases are all publicly available through PubDictionaries and PubAnnotation, respectively.
Collapse
Affiliation(s)
- Jin-Dong Kim
- Database Center for Life Science, Research Organization of Information and Systems, Kashiwa, Chiba, Japan
| | - Yue Wang
- Database Center for Life Science, Research Organization of Information and Systems, Kashiwa, Chiba, Japan
| | - Toyofumi Fujiwara
- Database Center for Life Science, Research Organization of Information and Systems, Kashiwa, Chiba, Japan
| | - Shujiro Okuda
- Graduate School of Medical and Dental Sciences, Niigata University, Niigata, Japan
| | - Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA
| | - K Bretonnel Cohen
- Computational Bioscience Program, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA.,Université Paris-Saclay, LIMSI-ILES, France
| |
Collapse
|