1
|
O'Connor K, Golder S, Weissenbacher D, Klein AZ, Magge A, Gonzalez-Hernandez G. Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review. J Med Internet Res 2024; 26:e47923. [PMID: 38488839 PMCID: PMC10980991 DOI: 10.2196/47923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/28/2023] [Accepted: 08/01/2023] [Indexed: 03/19/2024] Open
Abstract
BACKGROUND Patient health data collected from a variety of nontraditional resources, commonly referred to as real-world data, can be a key information source for health and social science research. Social media platforms, such as Twitter (Twitter, Inc), offer vast amounts of real-world data. An important aspect of incorporating social media data in scientific research is identifying the demographic characteristics of the users who posted those data. Age and gender are considered key demographics for assessing the representativeness of the sample and enable researchers to study subgroups and disparities effectively. However, deciphering the age and gender of social media users poses challenges. OBJECTIVE This scoping review aims to summarize the existing literature on the prediction of the age and gender of Twitter users and provide an overview of the methods used. METHODS We searched 15 electronic databases and carried out reference checking to identify relevant studies that met our inclusion criteria: studies that predicted the age or gender of Twitter users using computational methods. The screening process was performed independently by 2 researchers to ensure the accuracy and reliability of the included studies. RESULTS Of the initial 684 studies retrieved, 74 (10.8%) studies met our inclusion criteria. Among these 74 studies, 42 (57%) focused on predicting gender, 8 (11%) focused on predicting age, and 24 (32%) predicted a combination of both age and gender. Gender prediction was predominantly approached as a binary classification task, with the reported performance of the methods ranging from 0.58 to 0.96 F1-score or 0.51 to 0.97 accuracy. Age prediction approaches varied in terms of classification groups, with a higher range of reported performance, ranging from 0.31 to 0.94 F1-score or 0.43 to 0.86 accuracy. The heterogeneous nature of the studies and the reporting of dissimilar performance metrics made it challenging to quantitatively synthesize results and draw definitive conclusions. CONCLUSIONS Our review found that although automated methods for predicting the age and gender of Twitter users have evolved to incorporate techniques such as deep neural networks, a significant proportion of the attempts rely on traditional machine learning methods, suggesting that there is potential to improve the performance of these tasks by using more advanced methods. Gender prediction has generally achieved a higher reported performance than age prediction. However, the lack of standardized reporting of performance metrics or standard annotated corpora to evaluate the methods used hinders any meaningful comparison of the approaches. Potential biases stemming from the collection and labeling of data used in the studies was identified as a problem, emphasizing the need for careful consideration and mitigation of biases in future studies. This scoping review provides valuable insights into the methods used for predicting the age and gender of Twitter users, along with the challenges and considerations associated with these methods.
Collapse
Affiliation(s)
- Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Su Golder
- Department of Health Sciences, University of York, York, United Kingdom
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Ari Z Klein
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Arjun Magge
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | | |
Collapse
|
2
|
O’Connor K, Weissenbacher D, Elyaderani A, Lautenbach E, Scotch M, Gonzalez-Hernandez G. Patient-Related Metadata Reported in Sequencing Studies of SARS-CoV-2: Protocol for a Scoping Review and Bibliometric Analysis. medRxiv 2024:2023.07.14.23292681. [PMID: 37503241 PMCID: PMC10371180 DOI: 10.1101/2023.07.14.23292681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Background There has been an unprecedented effort to sequence the SARS-CoV-2 virus and examine its molecular evolution. This has been facilitated by the availability of publicly accessible databases, the Global Initiative on Sharing All Influenza Data (GISAID) and GenBank, which collectively hold millions of SARS-CoV-2 sequence records. Genomic epidemiology, however, seeks to go beyond phylogenetic analysis by linking genetic information to patient characteristics and disease outcomes, enabling a comprehensive understanding of transmission dynamics and disease impact.While these repositories include fields reflecting patient-related metadata for a given sequence, inclusion of these demographic and clinical details is scarce. The extent to which patient-related metadata is reported in published sequencing studies and its quality remains largely unexplored. Methods The NIH's LitCovid collection will be used for automated classification of articles reporting having deposited SARS-CoV-2 sequences in public repositories, while an independent search will be conducted in PubMed for validation. Data extraction will be conducted using Covidence. The extracted data will be synthesized and summarized to quantify the availability of patient metadata in the published literature of SARS-CoV-2 sequencing studies. For the bibliometric analysis, relevant data points, such as author affiliations and citation metrics will be extracted. Discussion This scoping review will report on the extent and types of patient-related metadata reported in genomic viral sequencing studies of SARS-CoV-2, identify gaps in this reporting, and make recommendations for improving the quality and consistency of reporting in this area. The bibliometric analysis will uncover trends and patterns in the reporting of patient-related metadata, including differences in reporting based on study types or geographic regions. Co-occurrence networks of author keywords will also be presented. The insights gained from this study may help improve the quality and consistency of reporting patient metadata, enhancing the utility of sequence metadata and facilitating future research on infectious diseases.
Collapse
Affiliation(s)
- Karen O’Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA
| | - Amir Elyaderani
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ, USA
| | - Ebbing Lautenbach
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Division of Infectious Diseases, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Matthew Scotch
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ, USA
- College of Health Solutions, Arizona State University, Tempe, AZ, USA
| | | |
Collapse
|
3
|
Weissenbacher D, Courtright K, Rawal S, Crane-Droesch A, O'Connor K, Kuhl N, Merlino C, Foxwell A, Haines L, Puhl J, Gonzalez-Hernandez G. Detecting goals of care conversations in clinical notes with active learning. J Biomed Inform 2024; 151:104618. [PMID: 38431151 DOI: 10.1016/j.jbi.2024.104618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 01/22/2024] [Accepted: 02/26/2024] [Indexed: 03/05/2024]
Abstract
OBJECTIVE Goals of care (GOC) discussions are an increasingly used quality metric in serious illness care and research. Wide variation in documentation practices within the Electronic Health Record (EHR) presents challenges for reliable measurement of GOC discussions. Novel natural language processing approaches are needed to capture GOC discussions documented in real-world samples of seriously ill hospitalized patients' EHR notes, a corpus with a very low event prevalence. METHODS To automatically detect sentences documenting GOC discussions outside of dedicated GOC note types, we proposed an ensemble of classifiers aggregating the predictions of rule-based, feature-based, and three transformers-based classifiers. We trained our classifier on 600 manually annotated EHR notes among patients with serious illnesses. Our corpus exhibited an extremely imbalanced ratio between sentences discussing GOC and sentences that do not. This ratio challenges standard supervision methods to train a classifier. Therefore, we trained our classifier with active learning. RESULTS Using active learning, we reduced the annotation cost to fine-tune our ensemble by 70% while improving its performance in our test set of 176 EHR notes, with 0.557 F1-score for sentence classification and 0.629 for note classification. CONCLUSION When classifying notes, with a true positive rate of 72% (13/18) and false positive rate of 8% (13/158), our performance may be sufficient for deploying our classifier in the EHR to facilitate bedside clinicians' access to GOC conversations documented outside of dedicated notes types, without overburdening clinicians with false positives. Improvements are needed before using it to enrich trial populations or as an outcome measure.
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA.
| | - Katherine Courtright
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Siddharth Rawal
- DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Andrew Crane-Droesch
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Karen O'Connor
- DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas Kuhl
- The Department of Medicine, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Corinne Merlino
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Anessa Foxwell
- NewCourtland Center for Transitions and Health, School of Nursing, University of Pennsylvania, Philadelphia, PA, USA
| | - Lindsay Haines
- Hospice & Palliative Care, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Joseph Puhl
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | |
Collapse
|
4
|
Boonstra MJ, Weissenbacher D, Moore JH, Gonzalez-Hernandez G, Asselbergs FW. Artificial intelligence: revolutionizing cardiology with large language models. Eur Heart J 2024; 45:332-345. [PMID: 38170821 PMCID: PMC10834163 DOI: 10.1093/eurheartj/ehad838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 12/01/2023] [Accepted: 12/05/2023] [Indexed: 01/05/2024] Open
Abstract
Natural language processing techniques are having an increasing impact on clinical care from patient, clinician, administrator, and research perspective. Among others are automated generation of clinical notes and discharge letters, medical term coding for billing, medical chatbots both for patients and clinicians, data enrichment in the identification of disease symptoms or diagnosis, cohort selection for clinical trial, and auditing purposes. In the review, an overview of the history in natural language processing techniques developed with brief technical background is presented. Subsequently, the review will discuss implementation strategies of natural language processing tools, thereby specifically focusing on large language models, and conclude with future opportunities in the application of such techniques in the field of cardiology.
Collapse
Affiliation(s)
- Machteld J Boonstra
- Department of Cardiology, Amsterdam Cardiovascular Sciences, Amsterdam University Medical Centre, University of Amsterdam, Amsterdam, Netherlands
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | | | - Folkert W Asselbergs
- Department of Cardiology, Amsterdam Cardiovascular Sciences, Amsterdam University Medical Centre, University of Amsterdam, Amsterdam, Netherlands
- Institute of Health Informatics, University College London, London, UK
- The National Institute for Health Research University College London Hospitals Biomedical Research Centre, University College London, London, UK
| |
Collapse
|
5
|
Micale C, Golder S, O'Connor K, Weissenbacher D, Gross R, Hennessy S, Gonzalez-Hernandez G. Correction to: Patient-Reported Reasons for Antihypertensive Medication Change: A Quantitative Study Using Social Media. Drug Saf 2024; 47:193. [PMID: 38231378 DOI: 10.1007/s40264-023-01394-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2024]
Affiliation(s)
- Cristina Micale
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, USA.
| | - Su Golder
- Department of Health Sciences, University of York, York, UK
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA
| | - Robert Gross
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Sean Hennessy
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | |
Collapse
|
6
|
Weissenbacher D, Rawal S, Zhao X, Priestley JRC, Szigety KM, Schmidt SF, Higgins MJ, Magge A, O'Connor K, Gonzalez-Hernandez G, Campbell IM. PhenoID, a language model normalizer of physical examinations from genetics clinical notes. medRxiv 2024:2023.10.16.23296894. [PMID: 37904943 PMCID: PMC10614999 DOI: 10.1101/2023.10.16.23296894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Background Phenotypes identified during dysmorphology physical examinations are critical to genetic diagnosis and nearly universally documented as free-text in the electronic health record (EHR). Variation in how phenotypes are recorded in free-text makes large-scale computational analysis extremely challenging. Existing natural language processing (NLP) approaches to address phenotype extraction are trained largely on the biomedical literature or on case vignettes rather than actual EHR data. Methods We implemented a tailored system at the Children's Hospital of Philadelpia that allows clinicians to document dysmorphology physical exam findings. From the underlying data, we manually annotated a corpus of 3136 organ system observations using the Human Phenotype Ontology (HPO). We provide this corpus publicly. We trained a transformer based NLP system to identify HPO terms from exam observations. The pipeline includes an extractor, which identifies tokens in the sentence expected to contain an HPO term, and a normalizer, which uses those tokens together with the original observation to determine the specific term mentioned. Findings We find that our labeler and normalizer NLP pipeline, which we call PhenoID, achieves state-of-the-art performance for the dysmorphology physical exam phenotype extraction task. PhenoID's performance on the test set was 0.717, compared to the nearest baseline system (Pheno-Tagger) performance of 0.633. An analysis of our system's normalization errors shows possible imperfections in the HPO terminology itself but also reveals a lack of semantic understanding by our transformer models. Interpretation Transformers-based NLP models are a promising approach to genetic phenotype extraction and, with recent development of larger pre-trained causal language models, may improve semantic understanding in the future. We believe our results also have direct applicability to more general extraction of medical signs and symptoms. Funding US National Institutes of Health.
Collapse
|
7
|
Micale C, Golder S, O'Connor K, Weissenbacher D, Gross R, Hennessy S, Gonzalez-Hernandez G. Patient-Reported Reasons for Antihypertensive Medication Change: A Quantitative Study Using Social Media. Drug Saf 2024; 47:81-91. [PMID: 37995049 DOI: 10.1007/s40264-023-01366-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/10/2023] [Indexed: 11/24/2023]
Abstract
INTRODUCTION Hypertension is the leading cause of heart disease in the world, and discontinuation or nonadherence of antihypertensive medication constitutes a significant global health concern. Patients with hypertension have high rates of medication nonadherence. Studies of reasons for nonadherence using traditional surveys are limited, can be expensive, and suffer from response, white-coat, and recall biases. Mining relevant posts by patients on social media is inexpensive and less impacted by the pressures and biases of formal surveys, which may provide direct insights into factors that lead to non-compliance with antihypertensive medication. METHODS This study examined medication ratings posted to WebMD, an online health forum that allows patients to post medication reviews. We used a previously developed natural language processing classifier to extract indications and reasons for changes in angiotensin receptor II blocker (ARB) and angiotensin-converting enzyme inhibitor (ACEI) treatments. After extraction, ratings were manually annotated and compared with data from the US Food and Drug administration (FDA) Adverse Events Reporting System (FAERS) public database. RESULTS From a collection of 343,459 WebMD reviews, we automatically extracted 1867 posts mentioning changes in ACEIs or ARBs, and manually reviewed the 300 most recent posts regarding ACEI treatments and the 300 most recent posts regarding ARB treatments. After excluding posts that only mentioned a dose change or were a false-positive mention, 142 posts in the ARBs dataset and 187 posts in the ACEIs dataset remained. The majority of posts (97% ARBs, 91% ACEIs) indicated experiencing an adverse event as the reason for medication change. The most common adverse events reported mapped to the Medical Dictionary for Regulatory Activities were "musculoskeletal and connective tissue disorders" like muscle and joint pain for ARBs, and "respiratory, thoracic, and mediastinal disorders" like cough and shortness of breath for ACEIs. These categories also had the largest differences in percentage points, appearing more frequently on WebMD data than FDA data (p < 0.001). CONCLUSION Musculoskeletal and respiratory symptoms were the most commonly reported adverse effects in social media postings associated with drug discontinuation. Managing such symptoms is a potential target of interventions seeking to improve medication persistence.
Collapse
Affiliation(s)
- Cristina Micale
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA, USA.
| | - Su Golder
- Department of Health Sciences, University of York, York, UK
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA
| | - Robert Gross
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Sean Hennessy
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | |
Collapse
|
8
|
Lanera C, Lorenzoni G, Barbieri E, Piras G, Magge A, Weissenbacher D, Donà D, Cantarutti L, Gonzalez-Hernandez G, Giaquinto C, Gregori D. Monitoring the Epidemiology of Otitis Using Free-Text Pediatric Medical Notes: A Deep Learning Approach. J Pers Med 2023; 14:28. [PMID: 38248729 PMCID: PMC10817419 DOI: 10.3390/jpm14010028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 12/20/2023] [Accepted: 12/21/2023] [Indexed: 01/23/2024] Open
Abstract
Free-text information represents a valuable resource for epidemiological surveillance. Its unstructured nature, however, presents significant challenges in the extraction of meaningful information. This study presents a deep learning model for classifying otitis using pediatric medical records. We analyzed the Pedianet database, which includes data from January 2004 to August 2017. The model categorizes narratives from clinical record diagnoses into six types: no otitis, non-media otitis, non-acute otitis media (OM), acute OM (AOM), AOM with perforation, and recurrent AOM. Utilizing deep learning architectures, including an ensemble model, this study addressed the challenges associated with the manual classification of extensive narrative data. The performance of the model was evaluated according to a gold standard classification made by three expert clinicians. The ensemble model achieved values of 97.03, 93.97, 96.59, and 95.48 for balanced precision, balanced recall, accuracy, and balanced F1 measure, respectively. These results underscore the efficacy of using automated systems for medical diagnoses, especially in pediatric care. Our findings demonstrate the potential of deep learning in interpreting complex medical records, enhancing epidemiological surveillance and research. This approach offers significant improvements in handling large-scale medical data, ensuring accuracy and minimizing human error. The methodology is adaptable to other medical contexts, promising a new horizon in healthcare analytics.
Collapse
Affiliation(s)
- Corrado Lanera
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| | - Giulia Lorenzoni
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| | - Elisa Barbieri
- Division of Pediatric Infectious Diseases, Department for Woman and Child Health, University of Padova, 35128 Padova, Italy; (E.B.); (D.D.); (C.G.)
| | - Gianluca Piras
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| | - Arjun Magge
- Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; (A.M.); (D.W.); (G.G.-H.)
| | - Davy Weissenbacher
- Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; (A.M.); (D.W.); (G.G.-H.)
| | - Daniele Donà
- Division of Pediatric Infectious Diseases, Department for Woman and Child Health, University of Padova, 35128 Padova, Italy; (E.B.); (D.D.); (C.G.)
| | | | - Graciela Gonzalez-Hernandez
- Health Language Processing Center, Institute for Biomedical Informatics at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; (A.M.); (D.W.); (G.G.-H.)
| | - Carlo Giaquinto
- Division of Pediatric Infectious Diseases, Department for Woman and Child Health, University of Padova, 35128 Padova, Italy; (E.B.); (D.D.); (C.G.)
- Società Servizi Telematici—Pedianet, 35100 Padova, Italy;
| | - Dario Gregori
- Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac, Thoracic, Vascular Sciences and Public Health, University of Padova, 35131 Padova, Italy; (C.L.); (G.L.)
| |
Collapse
|
9
|
Weissenbacher D, O'Connor K, Klein A, Golder S, Flores I, Elyaderani A, Scotch M, Gonzalez-Hernandez G. Text mining biomedical literature to identify extremely unbalanced data for digital epidemiology and systematic reviews: dataset and methods for a SARS-CoV-2 genomic epidemiology study. medRxiv 2023:2023.07.29.23293370. [PMID: 37577535 PMCID: PMC10418574 DOI: 10.1101/2023.07.29.23293370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
There are many studies that require researchers to extract specific information from the published literature, such as details about sequence records or about a randomized control trial. While manual extraction is cost efficient for small studies, larger studies such as systematic reviews are much more costly and time-consuming. To avoid exhaustive manual searches and extraction, and their related cost and effort, natural language processing (NLP) methods can be tailored for the more subtle extraction and decision tasks that typically only humans have performed. The need for such studies that use the published literature as a data source became even more evident as the COVID-19 pandemic raged through the world and millions of sequenced samples were deposited in public repositories such as GISAID and GenBank, promising large genomic epidemiology studies, but more often than not lacked many important details that prevented large-scale studies. Thus, granular geographic location or the most basic patient-relevant data such as demographic information, or clinical outcomes were not noted in the sequence record. However, some of these data was indeed published, but in the text, tables, or supplementary material of a corresponding published article. We present here methods to identify relevant journal articles that report having produced and made available in GenBank or GISAID, new SARS-CoV-2 sequences, as those that initially produced and made available the sequences are the most likely articles to include the high-level details about the patients from whom the sequences were obtained. Human annotators validated the approach, creating a gold standard set for training and validation of a machine learning classifier. Identifying these articles is a crucial step to enable future automated informatics pipelines that will apply Machine Learning and Natural Language Processing to identify patient characteristics such as co-morbidities, outcomes, age, gender, and race, enriching SARS-CoV-2 sequence databases with actionable information for defining large genomic epidemiology studies. Thus, enriched patient metadata can enable secondary data analysis, at scale, to uncover associations between the viral genome (including variants of concern and their sublineages), transmission risk, and health outcomes. However, for such enrichment to happen, the right papers need to be found and very detailed data needs to be extracted from them. Further, finding the very specific articles needed for inclusion is a task that also facilitates scoping and systematic reviews, greatly reducing the time needed for full-text analysis and extraction.
Collapse
Affiliation(s)
| | | | - Ari Klein
- University of Pennsylvania, Philadelphia, PA, USA
| | - Su Golder
- University of York, York, United Kingdom
| | - Ivan Flores
- Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | | | | | | |
Collapse
|
10
|
Weissenbacher D, O’Connor K, Rawal S, Zhang Y, Tsai RTH, Miller T, Xu D, Anderson C, Liu B, Han Q, Zhang J, Kulev I, Köprü B, Rodriguez-Esteban R, Ozkirimli E, Ayach A, Roller R, Piccolo S, Han P, Vydiswaran VGV, Tekumalla R, Banda JM, Bagherzadeh P, Bergler S, Silva JF, Almeida T, Martinez P, Rivera-Zavala R, Wang CK, Dai HJ, Alberto Robles Hernandez L, Gonzalez-Hernandez G. Automatic Extraction of Medication Mentions from Tweets-Overview of the BioCreative VII Shared Task 3 Competition. Database (Oxford) 2023; 2023:baac108. [PMID: 36734300 PMCID: PMC9896308 DOI: 10.1093/database/baac108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 10/28/2022] [Accepted: 12/13/2022] [Indexed: 02/04/2023]
Abstract
This study presents the outcomes of the shared task competition BioCreative VII (Task 3) focusing on the extraction of medication names from a Twitter user's publicly available tweets (the user's 'timeline'). In general, detecting health-related tweets is notoriously challenging for natural language processing tools. The main challenge, aside from the informality of the language used, is that people tweet about any and all topics, and most of their tweets are not related to health. Thus, finding those tweets in a user's timeline that mention specific health-related concepts such as medications requires addressing extreme imbalance. Task 3 called for detecting tweets in a user's timeline that mentions a medication name and, for each detected mention, extracting its span. The organizers made available a corpus consisting of 182 049 tweets publicly posted by 212 Twitter users with all medication mentions manually annotated. The corpus exhibits the natural distribution of positive tweets, with only 442 tweets (0.2%) mentioning a medication. This task was an opportunity for participants to evaluate methods that are robust to class imbalance beyond the simple lexical match. A total of 65 teams registered, and 16 teams submitted a system run. This study summarizes the corpus created by the organizers and the approaches taken by the participating teams for this challenge. The corpus is freely available at https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/. The methods and the results of the competing systems are analyzed with a focus on the approaches taken for learning from class-imbalanced data.
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Karen O’Connor
- DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Siddharth Rawal
- DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Yu Zhang
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd, Zhongli District, Taoyuan 320, Taiwan
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd, Zhongli District, Taoyuan 320, Taiwan
- IoX Center, National Taiwan University, Da’an District, Section 4, Roosevelt Rd, No. 1, Barry Lam Hall, Taipei 106, Taiwan
- Research Center for Humanities and Social Sciences, Academia Sinica, No. 128, Section 2, Academia Rd, Nangang District, Taipei 115, Taiwan
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Dongfang Xu
- Computational Health Informatics Program, Boston Children’s Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | | | - Bo Liu
- NVIDIA, Santa Clara, CA, USA
| | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Igor Kulev
- Data and Analytics Chapter, F. Hoffmann-La Roche Ltd, Switzerland
| | - Berkay Köprü
- Data and Analytics Chapter, F. Hoffmann-La Roche Ltd, Switzerland
| | - Raul Rodriguez-Esteban
- Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Switzerland
| | - Elif Ozkirimli
- Data and Analytics Chapter, F. Hoffmann-La Roche Ltd, Switzerland
| | - Ammer Ayach
- Speech and Language Technology Lab, DFKI, Berlin, Germany
| | - Roland Roller
- Speech and Language Technology Lab, DFKI, Berlin, Germany
| | - Stephen Piccolo
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Peijin Han
- Department of Computational Medicine and Bioinformatics, Medical School, University of Michigan, Ann Arbor, MI, USA
| | - V G Vinod Vydiswaran
- Department of Learning Health Sciences, Medical School, University of Michigan, Ann Arbor, MI, USA
- School of Information, University of Michigan, Ann Arbor, MI, USA
| | - Ramya Tekumalla
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Juan M Banda
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | | | | | - João F Silva
- DETI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Portugal
| | - Tiago Almeida
- DETI, Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Portugal
- Department of Computation, University of A Coruña, Spain
| | - Paloma Martinez
- Computer Science and Engineering Department, Universidad Carlos III de Madrid, Madrid, Spain
| | - Renzo Rivera-Zavala
- Computer Science and Engineering Department, Universidad Carlos III de Madrid, Madrid, Spain
| | - Chen-Kai Wang
- Big Data Laboratory, Chunghwa Telecom Laboratories, Taoyuan, Taiwan
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
| | | | | |
Collapse
|
11
|
Golder S, Weissenbacher D, O’Connor K, Hennessy S, Gross R, Hernandez GG. Patient-Reported Reasons for Switching or Discontinuing Statin Therapy: A Mixed Methods Study Using Social Media. Drug Saf 2022; 45:971-981. [PMID: 35933649 PMCID: PMC9402720 DOI: 10.1007/s40264-022-01212-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/10/2022] [Indexed: 11/16/2022]
Abstract
Introduction Statin discontinuation can have major negative health consequences. Studying the reasons for discontinuation can be challenging as traditional data collection methods have limitations. We propose an alternative approach using social media. Methods We used natural language processing and machine learning to extract mentions of discontinuation of statin therapy from an online health forum, WebMD (http://www.webmd.com). We then extracted data according to themes and identified key attributes of the people posting for themselves. Results We identified 2121 statin reviews that contained information on discontinuing at least one named statin. Sixty percent of people posting declared themselves as female and the most common age category was 55–64 years. Over half the people taking statins did so for < 6 months. By far the most common reason given (90%) was patient experience of adverse events, the most common of which were musculoskeletal and connective tissue disorders. The rank order of adverse events reported in WebMD was largely consistent with those reported to regulatory agencies in the US and UK. Data were available on age, sex, duration of statin use, and, in some instances, adverse event resolution and rechallenge. In some instances, details were presented on resolution of the adverse event and rechallenge. Conclusion Social media may provide data on the reasons for switching or discontinuation of a medication, as well as unique patient perspectives that may influence continuation of a medication. This information source may provide unique data for novel interventions to reduce medication discontinuation. Supplementary Information The online version contains supplementary material available at 10.1007/s40264-022-01212-0.
Collapse
|
12
|
Weissenbacher D, Flores JI, Wang Y, O’Connor K, Rawal S, Stevens R, Gonzalez-Hernandez G. Automatic Cohort Determination from Twitter for HIV Prevention amongst Black and Hispanic Men. AMIA Annu Symp Proc 2022; 2022:504-513. [PMID: 35854738 PMCID: PMC9285152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Recruiting people from diverse backgrounds to participate in health research requires intentional and culture-driven strategic efforts. In this study, we utilize publicly available Twitter posts to identify targeted populations to recruit for our HIV prevention study. Natural language processing and machine learning classification methods were used to find self-declarations of ethnicity, gender, age group, and sexually-explicit language. Using the official Twitter API we collected 47.4 million tweets posted over 8 months from two areas geo-centered around Los Angeles. Using available tools (Demographer and M3), we identified the age and race of 5,392 users as likely young Black or Hispanic men living in Los Angeles. We then collected and analyzed their timelines to automatically find sex-related tweets, yielding 2,166 users. Despite a limited precision, our results suggest that it is possible to automatically identify users based on their demographic attributes and Twitter language characteristics for enrollment into epidemiological studies.
Collapse
Affiliation(s)
| | - J. Ivan Flores
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Yunwen Wang
- University of Southern California, Los Angeles, California, USA
| | - Karen O’Connor
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | | - Robin Stevens
- University of Southern California, Los Angeles, California, USA
| | | |
Collapse
|
13
|
Magge A, Weissenbacher D, O'Connor K, Scotch M, Gonzalez-Hernandez G. SEED: Symptom Extraction from English Social Media Posts using Deep Learning and Transfer Learning. medRxiv 2022:2021.02.09.21251454. [PMID: 33594374 PMCID: PMC7885933 DOI: 10.1101/2021.02.09.21251454] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The increase of social media usage across the globe has fueled efforts in digital epidemiology for mining valuable information such as medication use, adverse drug effects and reports of viral infections that directly and indirectly affect population health. Such specific information can, however, be scarce, hard to find, and mostly expressed in very colloquial language. In this work, we focus on a fundamental problem that enables social media mining for disease monitoring. We present and make available SEED, a natural language processing approach to detect symptom and disease mentions from social media data obtained from platforms such as Twitter and DailyStrength and to normalize them into UMLS terminology. Using multi-corpus training and deep learning models, the tool achieves an overall F1 score of 0.86 and 0.72 on DailyStrength and balanced Twitter datasets, significantly improving over previous approaches on the same datasets. We apply the tool on Twitter posts that report COVID19 symptoms, particularly to quantify whether the SEED system can extract symptoms absent in the training data. The study results also draw attention to the potential of multi-corpus training for performance improvements and the need for continuous training on newly obtained data for consistent performance amidst the ever-changing nature of the social media vocabulary.
Collapse
Affiliation(s)
- Arjun Magge
- Perelman School of Medicine, University of Pennsylvania
| | | | | | | | | |
Collapse
|
14
|
Golder S, Klein AZ, Magge A, O’Connor K, Cai H, Weissenbacher D, Gonzalez-Hernandez G. A chronological and geographical analysis of personal reports of COVID-19 on Twitter from the UK. Digit Health 2022; 8:20552076221097508. [PMID: 35574580 PMCID: PMC9096830 DOI: 10.1177/20552076221097508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 04/12/2022] [Indexed: 11/30/2022] Open
Abstract
Objective Given the uncertainty about the trends and extent of the rapidly evolving COVID-19 outbreak, and the lack of extensive testing in the United Kingdom, our understanding of COVID-19 transmission is limited. We proposed to use Twitter to identify personal reports of COVID-19 to assess whether this data can help inform as a source of data to help us understand and model the transmission and trajectory of COVID-19. Methods We used natural language processing and machine learning framework. We collected tweets (excluding retweets) from the Twitter Streaming API that indicate that the user or a member of the user's household had been exposed to COVID-19. The tweets were required to be geo-tagged or have profile location metadata in the UK. Results We identified a high level of agreement between personal reports from Twitter and lab-confirmed cases by geographical region in the UK. Temporal analysis indicated that personal reports from Twitter appear up to 2 weeks before UK government lab-confirmed cases are recorded. Conclusions Analysis of tweets may indicate trends in COVID-19 in the UK and provide signals of geographical locations where resources may need to be targeted or where regional policies may need to be put in place to further limit the spread of COVID-19. It may also help inform policy makers of the restrictions in lockdown that are most effective or ineffective.
Collapse
Affiliation(s)
- Su Golder
- Department of Health Sciences, University of York, York, UK
| | - Ari Z Klein
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School
of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Arjun Magge
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School
of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Karen O’Connor
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School
of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Haitao Cai
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School
of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School
of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School
of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
15
|
Weissenbacher D, Ge S, Klein A, O'Connor K, Gross R, Hennessy S, Gonzalez-Hernandez G. Active neural networks to detect mentions of changes to medication treatment in social media. J Am Med Inform Assoc 2021; 28:2551-2561. [PMID: 34613417 PMCID: PMC8633624 DOI: 10.1093/jamia/ocab158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2020] [Revised: 04/13/2021] [Accepted: 07/23/2021] [Indexed: 12/30/2022] Open
Abstract
Objective We address a first step toward using social media data to supplement current efforts in monitoring population-level medication nonadherence: detecting changes to medication treatment. Medication treatment changes, like changes to dosage or to frequency of intake, that are not overseen by physicians are, by that, nonadherence to medication. Despite the consequences, including worsening health conditions or death, 50% of patients are estimated to not take medications as indicated. Current methods to identify nonadherence have major limitations. Direct observation may be intrusive or expensive, and indirect observation through patient surveys relies heavily on patients’ memory and candor. Using social media data in these studies may address these limitations. Methods We annotated 9830 tweets mentioning medications and trained a convolutional neural network (CNN) to find mentions of medication treatment changes, regardless of whether the change was recommended by a physician. We used active and transfer learning from 12 972 reviews we annotated from WebMD to address the class imbalance of our Twitter corpus. To validate our CNN and explore future directions, we annotated 1956 positive tweets as to whether they reflect nonadherence and categorized the reasons given. Results Our CNN achieved 0.50 F1-score on this new corpus. The manual analysis of positive tweets revealed that nonadherence is evident in a subset with 9 categories of reasons for nonadherence. Conclusion We showed that social media users publicly discuss medication treatment changes and may explain their reasons including when it constitutes nonadherence. This approach may be useful to supplement current efforts in adherence monitoring.
Collapse
Affiliation(s)
- Davy Weissenbacher
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Suyu Ge
- Department of Electronic Engineering, Tsinghua University, Beijing, China
| | - Ari Klein
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Karen O'Connor
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Robert Gross
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Sean Hennessy
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | |
Collapse
|
16
|
Magge A, Tutubalina E, Miftahutdinov Z, Alimova I, Dirkson A, Verberne S, Weissenbacher D, Gonzalez-Hernandez G. DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter. J Am Med Inform Assoc 2021; 28:2184-2192. [PMID: 34270701 PMCID: PMC8449608 DOI: 10.1093/jamia/ocab114] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 05/20/2021] [Accepted: 06/08/2021] [Indexed: 11/17/2022] Open
Abstract
Objective Research on pharmacovigilance from social media data has focused on mining adverse drug events (ADEs) using annotated datasets, with publications generally focusing on 1 of 3 tasks: ADE classification, named entity recognition for identifying the span of ADE mentions, and ADE mention normalization to standardized terminologies. While the common goal of such systems is to detect ADE signals that can be used to inform public policy, it has been impeded largely by limited end-to-end solutions for large-scale analysis of social media reports for different drugs. Materials and Methods We present a dataset for training and evaluation of ADE pipelines where the ADE distribution is closer to the average ‘natural balance’ with ADEs present in about 7% of the tweets. The deep learning architecture involves an ADE extraction pipeline with individual components for all 3 tasks. Results The system presented achieved state-of-the-art performance on comparable datasets and scored a classification performance of F1 = 0.63, span extraction performance of F1 = 0.44 and an end-to-end entity resolution performance of F1 = 0.34 on the presented dataset. Discussion The performance of the models continues to highlight multiple challenges when deploying pharmacovigilance systems that use social media data. We discuss the implications of such models in the downstream tasks of signal detection and suggest future enhancements. Conclusion Mining ADEs from Twitter posts using a pipeline architecture requires the different components to be trained and tuned based on input data imbalance in order to ensure optimal performance on the end-to-end resolution task.
Collapse
Affiliation(s)
- Arjun Magge
- DBEI, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | | | | | | | | | | - Davy Weissenbacher
- DBEI, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | |
Collapse
|
17
|
Magge A, Weissenbacher D, O'Connor K, Tahsin T, Gonzalez-Hernandez G, Scotch M. GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography. Bioinformatics 2021; 36:5120-5121. [PMID: 32683454 PMCID: PMC7755405 DOI: 10.1093/bioinformatics/btaa647] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 07/03/2020] [Accepted: 07/13/2020] [Indexed: 12/27/2022] Open
Abstract
Summary We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning. Availability and implementation Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA.,Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Tasnia Tahsin
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine,University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Phoenix, AZ 85004, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA
| |
Collapse
|
18
|
Weissenbacher D, Sarker A, Klein A, O'Connor K, Magge A, Gonzalez-Hernandez G. Deep neural networks ensemble for detecting medication mentions in tweets. J Am Med Inform Assoc 2021; 26:1618-1626. [PMID: 31562510 PMCID: PMC6857507 DOI: 10.1093/jamia/ocz156] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Revised: 07/26/2019] [Accepted: 08/13/2019] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step toward incorporating Twitter data in pharmacoepidemiologic research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names suffer from low recall due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. MATERIALS AND METHODS We present Kusuri, an Ensemble Learning classifier able to identify tweets mentioning drug products and dietary supplements. Kusuri (, "medication" in Japanese) is composed of 2 modules: first, 4 different classifiers (lexicon based, spelling variant based, pattern based, and a weakly trained neural network) are applied in parallel to discover tweets potentially containing medication names; second, an ensemble of deep neural networks encoding morphological, semantic, and long-range dependencies of important words in the tweets makes the final decision. RESULTS On a class-balanced (50-50) corpus of 15 005 tweets, Kusuri demonstrated performances close to human annotators with an F1 score of 93.7%, the best score achieved thus far on this corpus. On a corpus made of all tweets posted by 112 Twitter users (98 959 tweets, with only 0.26% mentioning medications), Kusuri obtained an F1 score of 78.8%. To the best of our knowledge, Kusuri is the first system to achieve this score on such an extremely imbalanced dataset. CONCLUSIONS The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness, and is ready to be integrated in pharmacovigilance, toxicovigilance, or more generally, public health pipelines that depend on medication name mentions.
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Ari Klein
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Arjun Magge
- Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, Arizona, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
19
|
Klein AZ, Magge A, O'Connor K, Flores Amaro JI, Weissenbacher D, Gonzalez Hernandez G. Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set. J Med Internet Res 2021; 23:e25314. [PMID: 33449904 PMCID: PMC7834613 DOI: 10.2196/25314] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/14/2020] [Accepted: 12/14/2020] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results present challenges for actively monitoring its spread based on testing alone. OBJECTIVE The objective of this study was to develop, evaluate, and deploy an automatic natural language processing pipeline to collect user-generated Twitter data as a complementary resource for identifying potential cases of COVID-19 in the United States that are not based on testing and, thus, may not have been reported to the Centers for Disease Control and Prevention. METHODS Beginning January 23, 2020, we collected English tweets from the Twitter Streaming application programming interface that mention keywords related to COVID-19. We applied handwritten regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out "reported speech" (eg, quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing tweets that self-report potential cases of COVID-19 from those that do not. We used the annotated tweets to train and evaluate deep neural network classifiers based on bidirectional encoder representations from transformers (BERT). Finally, we deployed the automatic pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020. RESULTS Interannotator agreement, based on dual annotations for 3644 (41%) of the 8976 tweets, was 0.77 (Cohen κ). A deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases of COVID-19. Upon deploying our automatic pipeline, we identified 13,714 tweets that self-report potential cases of COVID-19 and have US state-level geolocations. CONCLUSIONS We have made the 13,714 tweets identified in this study, along with each tweet's time stamp and US state-level geolocation, publicly available to download. This data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
Collapse
Affiliation(s)
- Ari Z Klein
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Arjun Magge
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Jesus Ivan Flores Amaro
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Graciela Gonzalez Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
20
|
Weissenbacher D, O'Connor K, Hiraki AT, Kim JD, Gonzalez-Hernandez G. An empirical evaluation of electronic annotation tools for Twitter data. Genomics Inform 2020; 18:e24. [PMID: 32634878 PMCID: PMC7362942 DOI: 10.5808/gi.2020.18.2.e24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Accepted: 06/16/2020] [Indexed: 11/30/2022] Open
Abstract
Despite a growing number of natural language processing shared-tasks dedicated to the use of Twitter data, there is currently no ad-hoc annotation tool for the purpose. During the 6th edition of Biomedical Linked Annotation Hackathon (BLAH), after a short review of 19 generic annotation tools, we adapted GATE and TextAE for annotating Twitter timelines. Although none of the tools reviewed allow the annotation of all information inherent of Twitter timelines, a few may be suitable provided the willingness by annotators to compromise on some functionality.
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Aiko T Hiraki
- Database Center for Life Science, Research Organization of Information and Systems, Kashiwa, Chiba 277-0871, Japan
| | - Jin-Dong Kim
- Database Center for Life Science, Research Organization of Information and Systems, Kashiwa, Chiba 277-0871, Japan
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
21
|
Golder S, Klein AZ, Magge A, O'Connor K, Cai H, Weissenbacher D, Gonzalez-Hernandez G. Extending A Chronological and Geographical Analysis of Personal Reports of COVID-19 on Twitter to England, UK. medRxiv 2020:2020.05.05.20083436. [PMID: 32511492 PMCID: PMC7273260 DOI: 10.1101/2020.05.05.20083436] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The rapidly evolving COVID-19 pandemic presents challenges for actively monitoring its transmission. In this study, we extend a social media mining approach used in the US to automatically identify personal reports of COVID-19 on Twitter in England, UK. The findings indicate that natural language processing and machine learning framework could help provide an early indication of the chronological and geographical distribution of COVID-19 in England.
Collapse
Affiliation(s)
- S Golder
- Department of Health Sciences, University of York, York YO10 5DD, UK
| | - Ari Z Klein
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Arjun Magge
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Haitao Cai
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
22
|
Klein AZ, Magge A, O’connor K, Cai H, Weissenbacher D, Gonzalez-hernandez G. A Chronological and Geographical Analysis of Personal Reports of COVID-19 on Twitter.. [PMID: 32511608 PMCID: PMC7276035 DOI: 10.1101/2020.04.19.20069948] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
The rapidly evolving outbreak of COVID-19 presents challenges for actively monitoring its spread. In this study, we assessed a social media mining approach for automatically analyzing the chronological and geographical distribution of users in the United States reporting personal information related to COVID-19 on Twitter. The results suggest that our natural language processing and machine learning framework could help provide an early indication of the spread of COVID-19.
Collapse
|
23
|
Klein AZ, Cai H, Weissenbacher D, Levine LD, Gonzalez-Hernandez G. A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes. J Biomed Inform 2020; 112S:100076. [PMID: 34417007 DOI: 10.1016/j.yjbinx.2020.100076] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 06/30/2020] [Accepted: 07/27/2020] [Indexed: 10/23/2022]
Abstract
BACKGROUND In the United States, 17% of pregnancies end in fetal loss: miscarriage or stillbirth. Preterm birth affects 10% of live births in the United States and is the leading cause of neonatal death globally. Preterm births with low birthweight are the second leading cause of infant mortality in the United States. Despite their prevalence, the causes of miscarriage, stillbirth, and preterm birth are largely unknown. OBJECTIVE The primary objectives of this study are to (1) assess whether women report miscarriage, stillbirth, and preterm birth, among others, on Twitter, and (2) develop natural language processing (NLP) methods to automatically identify users from which to select cases for large-scale observational studies. METHODS We handcrafted regular expressions to retrieve tweets that mention an adverse pregnancy outcome, from a database containing more than 400 million publicly available tweets posted by more than 100,000 users who have announced their pregnancy on Twitter. Two annotators independently annotated 8109 (one random tweet per user) of the 22,912 retrieved tweets, distinguishing those reporting that the user has personally experienced the outcome ("outcome" tweets) from those that merely mention the outcome ("non-outcome" tweets). Inter-annotator agreement was κ = 0.90 (Cohen's kappa). We used the annotated tweets to train and evaluate feature-engineered and deep learning-based classifiers. We further annotated 7512 (of the 8109) tweets to develop a generalizable, rule-based module designed to filter out reported speech-that is, posts containing what was said by others-prior to automatic classification. We performed an extrinsic evaluation assessing whether the reported speech filter could improve the detection of women reporting adverse pregnancy outcomes on Twitter. RESULTS The tweets annotated as "outcome" include 1632 women reporting miscarriage, 119 stillbirth, 749 preterm birth or premature labor, 217 low birthweight, 558 NICU admission, and 458 fetal/infant loss in general. A deep neural network, BERT-based classifier achieved the highest overall F1-score (0.88) for automatically detecting "outcome" tweets (precision = 0.87, recall = 0.89), with an F1-score of at least 0.82 and a precision of at least 0.84 for each of the adverse pregnancy outcomes. Our reported speech filter significantly (P < 0.05) improved the accuracy of Logistic Regression (from 78.0% to 80.8%) and majority voting-based ensemble (from 81.1% to 82.9%) classifiers. Although the filter did not improve the F1-score of the BERT-based classifier, it did improve precision-a trade-off of recall that may be acceptable for automated case selection of more prevalent outcomes. Without the filter, reported speech is one of the main sources of errors for the BERT-based classifier. CONCLUSION This study demonstrates that (1) women do report their adverse pregnancy outcomes on Twitter, (2) our NLP pipeline can automatically identify users from which to select cases for large-scale observational studies, and (3) our reported speech filter would reduce the cost of annotating health-related social media data and can significantly improve the overall performance of feature-based classifiers.
Collapse
Affiliation(s)
- Ari Z Klein
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| | - Haitao Cai
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| | - Lisa D Levine
- Maternal and Child Health Research Center, Department of Obstetrics and Gynecology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
24
|
Klein AZ, Sarker A, Weissenbacher D, Gonzalez-Hernandez G. Towards scaling Twitter for digital epidemiology of birth defects. NPJ Digit Med 2019; 2:96. [PMID: 31583284 PMCID: PMC6773753 DOI: 10.1038/s41746-019-0170-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Accepted: 08/12/2019] [Indexed: 11/13/2022] Open
Abstract
Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes-the leading cause of infant mortality-could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms-feature-engineered and deep learning-based classifiers-that automatically distinguish tweets referring to the user's pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-sampling and over-sampling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F1-score of 0.65 for the "defect" class and 0.51 for the "possible defect" class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.
Collapse
Affiliation(s)
- Ari Z. Klein
- Department of Biostatistics, Epidemiology, and Informatics Perelman School of Medicine University of Pennsylvania, Philadelphia, PA USA
| | - Abeed Sarker
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics Perelman School of Medicine University of Pennsylvania, Philadelphia, PA USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics Perelman School of Medicine University of Pennsylvania, Philadelphia, PA USA
| |
Collapse
|
25
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Deep neural networks and distant supervision for geographic location mention extraction. Bioinformatics 2019; 34:i565-i573. [PMID: 29950020 PMCID: PMC6022665 DOI: 10.1093/bioinformatics/bty273] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Motivation Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional samples to train our NER. Results Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
Collapse
Affiliation(s)
- Arjun Magge
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA.,Biodesign Center for Environmental Health Engineering, Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
26
|
Tahsin T, Weissenbacher D, O'Connor K, Magge A, Scotch M, Gonzalez-Hernandez G. GeoBoost: accelerating research involving the geospatial metadata of virus GenBank records. Bioinformatics 2019; 34:1606-1608. [PMID: 29240889 DOI: 10.1093/bioinformatics/btx799] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2017] [Accepted: 12/11/2017] [Indexed: 11/13/2022] Open
Abstract
Summary GeoBoost is a command-line software package developed to address sparse or incomplete metadata in GenBank sequence records that relate to the location of the infected host (LOIH) of viruses. Given a set of GenBank accession numbers corresponding to virus GenBank records, GeoBoost extracts, integrates and normalizes geographic information reflecting the LOIH of the viruses using integrated information from GenBank metadata and related full-text publications. In addition, to facilitate probabilistic geospatial modeling, GeoBoost assigns probability scores for each possible LOIH. Availability and implementation Binaries and resources required for running GeoBoost are packed into a single zipped file and freely available for download at https://tinyurl.com/geoboost. A video tutorial is included to help users quickly and easily install and run the software. The software is implemented in Java 1.8, and supported on MS Windows and Linux platforms. Contact gragon@upenn.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Karen O'Connor
- Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Arjun Magge
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
27
|
Scotch M, Tahsin T, Weissenbacher D, O'Connor K, Magge A, Vaiente M, Suchard MA, Gonzalez-Hernandez G. Incorporating sampling uncertainty in the geospatial assignment of taxa for virus phylogeography. Virus Evol 2019; 5:vey043. [PMID: 30838129 PMCID: PMC6395475 DOI: 10.1093/ve/vey043] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Discrete phylogeography using software such as BEAST considers the sampling location of each taxon as fixed; often to a single location without uncertainty. When studying viruses, this implies that there is no possibility that the location of the infected host for that taxa is somewhere else. Here, we relaxed this strong assumption and allowed for analytic integration of uncertainty for discrete virus phylogeography. We used automatic language processing methods to find and assign uncertainty to alternative potential locations. We considered two influenza case studies: H5N1 in Egypt; H1N1 pdm09 in North America. For each, we implemented scenarios in which 25 per cent of the taxa had different amounts of sampling uncertainty including 10, 30, and 50 per cent uncertainty and varied how it was distributed for each taxon. This includes scenarios that: (i) placed a specific amount of uncertainty on one location while uniformly distributing the remaining amount across all other candidate locations (correspondingly labeled 10, 30, and 50); (ii) assigned the remaining uncertainty to just one other location; thus ‘splitting’ the uncertainty among two locations (i.e. 10/90, 30/70, and 50/50); and (iii) eliminated uncertainty via two predefined heuristic approaches: assignment to a centroid location (CNTR) or the largest population in the country (POP). We compared all scenarios to a reference standard (RS) in which all taxa had known (absolutely certain) locations. From this, we implemented five random selections of 25 per cent of the taxa and used these for specifying uncertainty. We performed posterior analyses for each scenario, including: (a) virus persistence, (b) migration rates, (c) trunk rewards, and (d) the posterior probability of the root state. The scenarios with sampling uncertainty were closer to the RS than CNTR and POP. For H5N1, the absolute error of virus persistence had a median range of 0.005–0.047 for scenarios with sampling uncertainty—(i) and (ii) above—versus a range of 0.063–0.075 for CNTR and POP. Persistence for the pdm09 case study followed a similar trend as did our analyses of migration rates across scenarios (i) and (ii). When considering the posterior probability of the root state, we found all but one of the H5N1 scenarios with sampling uncertainty had agreement with the RS on the origin of the outbreak whereas both CNTR and POP disagreed. Our results suggest that assigning geospatial uncertainty to taxa benefits estimation of virus phylogeography as compared to ad-hoc heuristics. We also found that, in general, there was limited difference in results regardless of how the sampling uncertainty was assigned; uniform distribution or split between two locations did not greatly impact posterior results. This framework is available in BEAST v.1.10. In future work, we will explore viruses beyond influenza. We will also develop a web interface for researchers to use our language processing methods to find and assign uncertainty to alternative potential locations for virus phylogeography.
Collapse
Affiliation(s)
- Matthew Scotch
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ, USA
| | - Tasnia Tahsin
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 423 Guardian Drive, Philadelphia, PA, USA
| | - Karen O'Connor
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 423 Guardian Drive, Philadelphia, PA, USA
| | - Arjun Magge
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ, USA
| | - Matteo Vaiente
- College of Health Solutions, Arizona State University, 550 N. 3rd St., Phoenix, AZ, USA.,Biodesign Center for Environmental Health Engineering, Arizona State University, 727 E. Tyler St, Tempe, AZ, USA
| | - Marc A Suchard
- Department of Biomathematics, David Geffen School of Medicine, University of California, Los Angeles, 621 Charles E. Young Dr. South, Los Angeles, CA, USA.,Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, 695 Charles E. Young Dr. South, Los Angeles, CA, USA.,Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, 650 Charles E Young Dr. South, Los Angeles, CA, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 423 Guardian Drive, Philadelphia, PA, USA
| |
Collapse
|
28
|
Magge A, Weissenbacher D, Sarker A, Scotch M, Gonzalez-Hernandez G. Bi-directional Recurrent Neural Network Models for Geographic Location Extraction in Biomedical Literature. Pac Symp Biocomput 2019; 24:100-111. [PMID: 30864314 PMCID: PMC6417823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Phylogeography research involving virus spread and tree reconstruction relies on accurate geographic locations of infected hosts. Insufficient level of geographic information in nucleotide sequence repositories such as GenBank motivates the use of natural language processing methods for extracting geographic location names (toponyms) in the scientific article associated with the sequence, and disambiguating the locations to their co-ordinates. In this paper, we present an extensive study of multiple recurrent neural network architectures for the task of extracting geographic locations and their effective contribution to the disambiguation task using population heuristics. The methods presented in this paper achieve a strict detection F1 score of 0.94, disambiguation accuracy of 91% and an overall resolution F1 score of 0.88 that are significantly higher than previously developed methods, improving our capability to find the location of infected hosts and enrich metadata information.
Collapse
Affiliation(s)
- Arjun Magge
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- College of Health Solutions, Arizona State University, Tempe, AZ 85281, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology and Informatics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
29
|
Klein AZ, Sarker A, Cai H, Weissenbacher D, Gonzalez-Hernandez G. Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter. J Biomed Inform 2018; 87:68-78. [PMID: 30292855 PMCID: PMC6295660 DOI: 10.1016/j.jbi.2018.10.001] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Revised: 09/26/2018] [Accepted: 10/03/2018] [Indexed: 10/28/2022]
Abstract
BACKGROUND Although birth defects are the leading cause of infant mortality in the United States, methods for observing human pregnancies with birth defect outcomes are limited. OBJECTIVE The primary objectives of this study were (i) to assess whether rare health-related events-in this case, birth defects-are reported on social media, (ii) to design and deploy a natural language processing (NLP) approach for collecting such sparse data from social media, and (iii) to utilize the collected data to discover a cohort of women whose pregnancies with birth defect outcomes could be observed on social media for epidemiological analysis. METHODS To assess whether birth defects are mentioned on social media, we mined 432 million tweets posted by 112,647 users who were automatically identified via their public announcements of pregnancy on Twitter. To retrieve tweets that mention birth defects, we developed a rule-based, bootstrapping approach, which relies on a lexicon, lexical variants generated from the lexicon entries, regular expressions, post-processing, and manual analysis guided by distributional properties. To identify users whose pregnancies with birth defect outcomes could be observed for epidemiological analysis, inclusion criteria were (i) tweets indicating that the user's child has a birth defect, and (ii) accessibility to the user's tweets during pregnancy. We conducted a semi-automatic evaluation to estimate the recall of the tweet-collection approach, and performed a preliminary assessment of the prevalence of selected birth defects among the pregnancy cohort derived from Twitter. RESULTS We manually annotated 16,822 retrieved tweets, distinguishing tweets indicating that the user's child has a birth defect (true positives) from tweets that merely mention birth defects (false positives). Inter-annotator agreement was substantial: κ = 0.79 (Cohen's kappa). Analyzing the timelines of the 646 users whose tweets were true positives resulted in the discovery of 195 users that met the inclusion criteria. Congenital heart defects are the most common type of birth defect reported on Twitter, consistent with findings in the general population. Based on an evaluation of 4169 tweets retrieved using alternative text mining methods, the recall of the tweet-collection approach was 0.95. CONCLUSIONS Our contributions include (i) evidence that rare health-related events are indeed reported on Twitter, (ii) a generalizable, systematic NLP approach for collecting sparse tweets, (iii) a semi-automatic method to identify undetected tweets (false negatives), and (iv) a collection of publicly available tweets by pregnant users with birth defect outcomes, which could be used for future epidemiological analysis. In future work, the annotated tweets could be used to train machine learning algorithms to automatically identify users reporting birth defect outcomes, enabling the large-scale use of social media mining as a complementary method for such epidemiological research.
Collapse
Affiliation(s)
- Ari Z Klein
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| | - Abeed Sarker
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| | - Haitao Cai
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| | - Davy Weissenbacher
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| | - Graciela Gonzalez-Hernandez
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.
| |
Collapse
|
30
|
Weissenbacher D, Sarker A, Tahsin T, Scotch M, Gonzalez G. Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods. AMIA Jt Summits Transl Sci Proc 2017; 2017:114-122. [PMID: 28815119 PMCID: PMC5543364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.
Collapse
Affiliation(s)
| | - Abeed Sarker
- University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | | | | | | |
Collapse
|
31
|
Tahsin T, Weissenbacher D, Jones-Shargani D, Magee D, Vaiente M, Gonzalez G, Scotch M. Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research. Database (Oxford) 2017; 2017:4781736. [PMID: 30412219 PMCID: PMC6225896 DOI: 10.1093/database/bax093] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2017] [Revised: 11/20/2017] [Accepted: 11/21/2017] [Indexed: 02/06/2023]
Abstract
DATABASE URL : https://zodo.asu.edu/zoophydb/.
Collapse
Affiliation(s)
- Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Demetrius Jones-Shargani
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Daniel Magee
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Matteo Vaiente
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Institute of Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, PA 19104, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
- Biodesign Center for Environmental Health Engineering, Arizona State University 781 E, Terrace Mall Tempe, AZ 85281 USA
| |
Collapse
|
32
|
Tahsin T, Weissenbacher D, Rivera R, Beard R, Firago M, Wallstrom G, Scotch M, Gonzalez G. A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. J Am Med Inform Assoc 2016; 23:934-41. [PMID: 26911818 PMCID: PMC4997033 DOI: 10.1093/jamia/ocv172] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2015] [Revised: 10/22/2015] [Accepted: 10/22/2015] [Indexed: 01/09/2023] Open
Abstract
OBJECTIVE The metadata reflecting the location of the infected host (LOIH) of virus sequences in GenBank often lacks specificity. This work seeks to enhance this metadata by extracting more specific geographic information from related full-text articles and mapping them to their latitude/longitudes using knowledge derived from external geographical databases. MATERIALS AND METHODS We developed a rule-based information extraction framework for linking GenBank records to the latitude/longitudes of the LOIH. Our system first extracts existing geospatial metadata from GenBank records and attempts to improve it by seeking additional, relevant geographic information from text and tables in related full-text PubMed Central articles. The final extracted locations of the records, based on data assimilated from these sources, are then disambiguated and mapped to their respective geo-coordinates. We evaluated our approach on a manually annotated dataset comprising of 5728 GenBank records for the influenza A virus. RESULTS We found the precision, recall, and f-measure of our system for linking GenBank records to the latitude/longitudes of their LOIH to be 0.832, 0.967, and 0.894, respectively. DISCUSSION Our system had a high level of accuracy for linking GenBank records to the geo-coordinates of the LOIH. However, it can be further improved by expanding our database of geospatial data, incorporating spell correction, and enhancing the rules used for extraction. CONCLUSION Our system performs reasonably well for linking GenBank records for the influenza A virus to the geo-coordinates of their LOIH based on record metadata and information extracted from related full-text articles.
Collapse
Affiliation(s)
- Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Robert Rivera
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Rachel Beard
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Mari Firago
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Garrick Wallstrom
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, 13212 E Shea Blvd, Scottsdale, AZ 85259, USA
| |
Collapse
|
33
|
Weissenbacher D, Tahsin T, Beard R, Figaro M, Rivera R, Scotch M, Gonzalez G. Knowledge-driven geospatial location resolution for phylogeographic models of virus migration. Bioinformatics 2015; 31:i348-56. [PMID: 26072502 PMCID: PMC4542781 DOI: 10.1093/bioinformatics/btv259] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Summary: Diseases caused by zoonotic viruses (viruses transmittable between humans and animals) are a major threat to public health throughout the world. By studying virus migration and mutation patterns, the field of phylogeography provides a valuable tool for improving their surveillance. A key component in phylogeographic analysis of zoonotic viruses involves identifying the specific locations of relevant viral sequences. This is usually accomplished by querying public databases such as GenBank and examining the geospatial metadata in the record. When sufficient detail is not available, a logical next step is for the researcher to conduct a manual survey of the corresponding published articles. Motivation: In this article, we present a system for detection and disambiguation of locations (toponym resolution) in full-text articles to automate the retrieval of sufficient metadata. Our system has been tested on a manually annotated corpus of journal articles related to phylogeography using integrated heuristics for location disambiguation including a distance heuristic, a population heuristic and a novel heuristic utilizing knowledge obtained from GenBank metadata (i.e. a ‘metadata heuristic’). Results: For detecting and disambiguating locations, our system performed best using the metadata heuristic (0.54 Precision, 0.89 Recall and 0.68 F-score). Precision reaches 0.88 when examining only the disambiguation of location names. Our error analysis showed that a noticeable increase in the accuracy of toponym resolution is possible by improving the geospatial location detection. By improving these fundamental automated tasks, our system can be a useful resource to phylogeographers that rely on geospatial metadata of GenBank sequences. Contact:davy.weissenbacher@asu.edu
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Tasnia Tahsin
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Rachel Beard
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Mari Figaro
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Robert Rivera
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Matthew Scotch
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| | - Graciela Gonzalez
- Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ 85259, USA and Center for Environmental Security, Biodesign Institute, Arizona State University, Tempe, AZ 85287-5904, USA
| |
Collapse
|
34
|
Ananiadou S, Thompson P, Thomas J, Mu T, Oliver S, Rickinson M, Sasaki Y, Weissenbacher D, McNaught J. Supporting the education evidence portal via text mining. Philos Trans A Math Phys Eng Sci 2010; 368:3829-3844. [PMID: 20643679 PMCID: PMC2981997 DOI: 10.1098/rsta.2010.0152] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500,000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents.
Collapse
Affiliation(s)
- Sophia Ananiadou
- School of Computer Science and National Centre for Text Mining, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK.
| | | | | | | | | | | | | | | | | |
Collapse
|