1
|
Logan R, Wehe AW, Woods DC, Tilly J, Khrapko K. Interpreting Sequence-Levenshtein distance for determining error type and frequency between two embedded sequences of equal length. ArXiv 2023:arXiv:2310.12833v1. [PMID: 37904736 PMCID: PMC10614987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Levenshtein distance is a commonly used edit distance metric, typically applied in language processing, and to a lesser extent, in molecular biology analysis. Biological nucleic acid sequences are often embedded in longer sequences and are subject to insertion and deletion errors that introduce frameshift during sequencing. These frameshift errors are due to string context and should not be counted as true biological errors. Sequence-Levenshtein distance is a modification to Levenshtein distance that is permissive of frameshift error without additional penalty. However, in a biological context Levenshtein distance needs to accommodate both frameshift and weighted errors, which Sequence-Levenshtein distance cannot do. Errors are weighted when they are associated with a numerical cost that corresponds to their frequency of appearance. Here, we describe a modification that allows the use of Levenshtein distance and Sequence-Levenshtein distance to appropriately accommodate penalty-free frameshift between embedded sequences and correctly weight specific error types.
Collapse
Affiliation(s)
- Robert Logan
- Science and Technology Division, Biology and Bioinformatics Department, Eastern Nazarene College, Quincy, MA 02170
| | - Amy Wangsness Wehe
- Health and Natural Sciences Division, Mathematics Department, Fitchburg State University, Fitch-burg, MA 01420-2697
| | - Dori C Woods
- College of Science, Department of Biology, Northeastern University, 330 Huntington Ave, Boston, MA 02115
| | - Jon Tilly
- College of Science, Department of Biology, Northeastern University, 330 Huntington Ave, Boston, MA 02115
| | - Konstantin Khrapko
- College of Science, Department of Biology, Northeastern University, 330 Huntington Ave, Boston, MA 02115
| |
Collapse
|
2
|
Speights Atkins M, Bailey DJ, Seals CD. Implementation of an automated grading tool for phonetic transcription training. Clin Linguist Phon 2023; 37:242-257. [PMID: 35380914 DOI: 10.1080/02699206.2022.2048314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 02/22/2022] [Accepted: 02/23/2022] [Indexed: 06/14/2023]
Abstract
Clinical phonetic transcription is regarded as a highly specialised skill requiring hours of practice for mastery. Although this skill is a critical part of students' clinical preparation to become speech-language pathologists, students often report feeling unprepared to apply the skill in clinical practice. Previous studies suggest that increased opportunities for practice and timely feedback on transcriptions are needed in order to develop skill confidence. However, providing more opportunities for practice can be impeded by the limited resources to manage the grading of additional assignments. The purpose of this study is to show the implementation of a web-based learning management system (LMS) designed in our labs for phonetics instruction. The Automated Phonetic Transcription Grading Tool (APTgt LMS) was developed to provide a platform for assignment delivery and automated grading of transcription assignments. The APTgt LMS has three embedded IPA keyboards (basic, advanced, and full IPA) and an automated edit distance algorithm modified by phonetic alignment principles, which allows for individualised scoring and visual course-level feedback in an interactive online environment. For pilot testing, student confidence was queried before and after practice opportunities using APTgt. A concurrent mixed methods research design was used to analyse four Likert scale and three open-ended questions. Student confidence in transcribing disordered speech was found to significantly increase (p <0.001) following additional practice. Students reported concerns related to accurate transcription of disordered speech and that additional practice is still needed. Tools like APTgt can aid in facilitating student learning and increasing student confidence in applied transcription.
Collapse
Affiliation(s)
- Marisha Speights Atkins
- Roxelyn and Richard Pepper Department of Communication Sciences and Disorders, Northwestern University, Evanston, Illinois, USA
| | - Dallin J Bailey
- Departement of Speech, Language, and Hearing Sciences, Auburn University, Auburn, Alabama, USA
| | - Cheryl D Seals
- Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama, USA
| |
Collapse
|
3
|
Soliman A, Rajasekaran S. FIRLA: A Fast Incremental Record Linkage Algorithm. J Biomed Inform 2022;:104094. [PMID: 35550929 DOI: 10.1016/j.jbi.2022.104094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Revised: 05/02/2022] [Accepted: 05/04/2022] [Indexed: 11/23/2022]
Abstract
Record linkage is an important problem studied widely in many domains including biomedical informatics. A standard version of this problem is to cluster records from several datasets, such that each cluster has records pertinent to just one individual. Typically, datasets are huge in size. Hence, existing record linkage algorithms take a very long time. It is thus essential to develop novel fast algorithms for record linkage. The incremental version of this problem is to link previously clustered records with new records added to the input datasets. A novel algorithm has been created to efficiently perform standard and incremental record linkage. This algorithm leverages a set of efficient techniques that significantly restrict the number of record pair comparisons and distance computations. Our algorithm shows an average speed-up of 2.4x (up to 4x) for the standard linkage problem as compared to the state-of-the-art, without any drop in linkage performance at all. On average, our algorithm can incrementally link records in just 33% of the time required for linking them from scratch. Our algorithms achieve comparable or superior linkage performance and outperform the state-of-the-art in terms of linking time in all cases where the number of comparison attributes is greater than two. In practice, more than two comparison attributes are quite common. The proposed algorithm is very efficient and could be used in practice for record linkage applications especially when records are being added over time and linkage output needs to be updated frequently.
Collapse
|
4
|
Bobroske K, Larish C, Cattrell A, Bjarnadóttir MV, Huan L. The bird's-eye view: A data-driven approach to understanding patient journeys from claims data. J Am Med Inform Assoc 2021; 27:1037-1045. [PMID: 32521006 DOI: 10.1093/jamia/ocaa052] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 03/31/2020] [Accepted: 04/09/2020] [Indexed: 12/29/2022] Open
Abstract
OBJECTIVE In preference-sensitive conditions such as back pain, there can be high levels of variability in the trajectory of patient care. We sought to develop a methodology that extracts a realistic and comprehensive understanding of the patient journey using medical and pharmaceutical insurance claims data. MATERIALS AND METHODS We processed a sample of 10 000 patient episodes (comprised of 113 215 back pain-related claims) into strings of characters, where each letter corresponds to a distinct encounter with the healthcare system. We customized the Levenshtein edit distance algorithm to evaluate the level of similarity between each pair of episodes based on both their content (types of events) and ordering (sequence of events). We then used clustering to extract the main variations of the patient journey. RESULTS The algorithm resulted in 12 comprehensive and clinically distinct patterns (clusters) of patient journeys that represent the main ways patients are diagnosed and treated for back pain. We further characterized demographic and utilization metrics for each cluster and observed clear differentiation between the clusters in terms of both clinical content and patient characteristics. DISCUSSION Despite being a complex and often noisy data source, administrative claims provide a unique longitudinal overview of patient care across multiple service providers and locations. This methodology leverages claims to capture a data-driven understanding of how patients traverse the healthcare system. CONCLUSIONS When tailored to various conditions and patient settings, this methodology can provide accurate overviews of patient journeys and facilitate a shift toward high-quality practice patterns.
Collapse
Affiliation(s)
- Katherine Bobroske
- Cambridge Centre for Health and Leadership Enterprise, University of Cambridge, Cambridge, United Kingdom
| | - Christine Larish
- Research and Development, Evolent Health, Arlington, Virginia, USA
| | - Anita Cattrell
- Research and Development, Evolent Health, Arlington, Virginia, USA
| | | | - Lawrence Huan
- Cambridge Centre for Health and Leadership Enterprise, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
5
|
Kate RJ. Clinical Term Normalization Using Learned Edit Patterns and Subconcept Matching: System Development and Evaluation. JMIR Med Inform 2021; 9:e23104. [PMID: 33443483 PMCID: PMC7843202 DOI: 10.2196/23104] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Revised: 10/31/2020] [Accepted: 11/18/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Clinical terms mentioned in clinical text are often not in their standardized forms as listed in clinical terminologies because of linguistic and stylistic variations. However, many automated downstream applications require clinical terms mapped to their corresponding concepts in clinical terminologies, thus necessitating the task of clinical term normalization. OBJECTIVE In this paper, a system for clinical term normalization is presented that utilizes edit patterns to convert clinical terms into their normalized forms. METHODS The edit patterns are automatically learned from the Unified Medical Language System (UMLS) Metathesaurus as well as from the given training data. The edit patterns are generalized sequences of edits that are derived from edit distance computations. The edit patterns are both character based as well as word based and are learned separately for different semantic types. In addition to these edit patterns, the system also normalizes clinical terms through the subconcepts mentioned within them. RESULTS The system was evaluated as part of the 2019 n2c2 Track 3 shared task of clinical term normalization. It obtained 80.79% accuracy on the standard test data. This paper includes ablation studies to evaluate the contributions of different components of the system. A challenging part of the task was disambiguation when a clinical term could be normalized to multiple concepts. CONCLUSIONS The learned edit patterns led the system to perform well on the normalization task. Given that the system is based on patterns, it is human interpretable and is also capable of giving insights about common variations of clinical terms mentioned in clinical text that are different from their standardized forms.
Collapse
Affiliation(s)
- Rohit J Kate
- Department of Computer Science, University of Wisconsin-Milwaukee, Milwaukee, WI, United States
| |
Collapse
|
6
|
Johns H, Hearne J, Bernhardt J, Churilov L. Clustering clinical and health care processes using a novel measure of dissimilarity for variable-length sequences of ordinal states. Stat Methods Med Res 2020; 29:3059-3075. [PMID: 32297567 DOI: 10.1177/0962280220917174] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Clinical and health care processes are often summarised through sequences of ordinal data describing patient's state over time. Identifying patterns in these sequences can provide valuable insights into patient progression trajectories for the purposes of clinical monitoring and quality assurance. However, both the variation in the length of each sequence and the ordinal nature of observable states present challenges to pattern identification. In this paper, we address these challenges by presenting a novel measure of dissimilarity for comparing two or more variable-length ordinal sequences that can be used in conjunction with conventional clustering methods to identify patterns in patient progression trajectories. We provide practical guidance on how this can be achieved, and demonstrate it in the context of identifying patterns in post-stoke recovery trajectories.
Collapse
Affiliation(s)
- Hannah Johns
- Center for Research Excellence in Stroke Rehabilitation, Florey Institute of Neuroscience and Mental Health, Heidelberg, Australia.,School of Science, RMIT University, Melbourne, Australia
| | - John Hearne
- School of Science, RMIT University, Melbourne, Australia
| | - Julie Bernhardt
- Center for Research Excellence in Stroke Rehabilitation, Florey Institute of Neuroscience and Mental Health, Heidelberg, Australia
| | - Leonid Churilov
- Center for Research Excellence in Stroke Rehabilitation, Florey Institute of Neuroscience and Mental Health, Heidelberg, Australia.,Melbourne Medical School, University of Melbourne, Melbourne, Australia
| |
Collapse
|
7
|
Abstract
Problems of genome rearrangement are central in both evolution and cancer. Most evolutionary scenarios have been studied under the assumption that the genome contains a single copy of each gene. In contrast, tumor genomes undergo deletions and duplications, and thus, the number of copies of genes varies. The number of copies of each segment along a chromosome is called its copy number profile (CNP). Understanding CNP changes can assist in predicting disease progression and treatment. To date, questions related to distances between CNPs gained little scientific attention. Here we focus on the following fundamental problem, introduced by Schwarz et al.: given two CNPs, u and v, compute the minimum number of operations transforming u into v, where the edit operations are segmental deletions and amplifications. We establish the computational complexity of this problem, showing that it is solvable in linear time and constant space.
Collapse
Affiliation(s)
- Ron Zeira
- 1 Blavatnik School of Computer Science, Tel-Aviv University , Tel-Aviv, Israel
| | - Meirav Zehavi
- 2 Department of Informatics, University of Bergen , Bergen, Norway
| | - Ron Shamir
- 1 Blavatnik School of Computer Science, Tel-Aviv University , Tel-Aviv, Israel
| |
Collapse
|
8
|
Brejová B, Kravec M, Landau GM, Vinař T. Fast computation of a string duplication history under no-breakpoint-reuse. Philos Trans A Math Phys Eng Sci 2014; 372:20130133. [PMID: 24751867 PMCID: PMC3996574 DOI: 10.1098/rsta.2013.0133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this paper, we provide an O(n log(2) n log log n log* n) algorithm to compute a duplication history of a string under no-breakpoint-reuse condition. The motivation of this problem stems from computational biology, in particular, from analysis of complex gene clusters. The problem is also related to computing edit distance with block operations, but, in our scenario, the start of the history is not fixed, but chosen to minimize the distance measure.
Collapse
Affiliation(s)
- Broňa Brejová
- Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia
| | - Martin Kravec
- Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia
| | - Gad M. Landau
- Department of Computer Science, University of Haifa, Haifa 31905, Israel
- Department of Computer Science and Engineering, NYU-Poly, Six MetroTech Center, Brooklyn, NY 11201-3840, USA
| | - Tomáš Vinař
- Faculty of Mathematics, Physics and Informatics, Comenius University, Mlynská dolina, 842 48 Bratislava, Slovakia
| |
Collapse
|