1
|
Hughes LD, Tsueng G, DiGiovanna J, Horvath TD, Rasmussen LV, Savidge TC, Stoeger T, Turkarslan S, Wu Q, Wu C, Su AI, Pache L. Addressing barriers in FAIR data practices for biomedical data. Sci Data 2023; 10:98. [PMID: 36823198 PMCID: PMC9950056 DOI: 10.1038/s41597-023-01969-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/13/2023] [Indexed: 02/25/2023] Open
Affiliation(s)
- Laura D Hughes
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| | - Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Jack DiGiovanna
- Velsera, 529 Main St, Suite 6610, Charlestown, MA, 02129, USA
| | - Thomas D Horvath
- Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children's Microbiome Center, Department of Pathology, Texas Children's Hospital, Houston, TX, 77030, USA
| | - Luke V Rasmussen
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Tor C Savidge
- Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children's Microbiome Center, Texas Children's Hospital, Houston, TX, 77030, USA
| | - Thomas Stoeger
- Department of Chemical and Biological Engineering, McCormick School of Engineering, Evanston, IL, 60208, USA
| | | | - Qinglong Wu
- Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
- Texas Children's Microbiome Center, Texas Children's Hospital, Houston, TX, 77030, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Andrew I Su
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Lars Pache
- Infectious and Inflammatory Disease Center, Immunity and Pathogenesis Program, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| |
Collapse
|
2
|
Tsueng G, Cano MAA, Bento J, Czech C, Kang M, Pache L, Rasmussen LV, Savidge TC, Starren J, Wu Q, Xin J, Yeaman MR, Zhou X, Su AI, Wu C, Brown L, Shabman RS, Hughes LD. Developing a standardized but extendable framework to increase the findability of infectious disease datasets. Sci Data 2023; 10:99. [PMID: 36823157 PMCID: PMC9950378 DOI: 10.1038/s41597-023-01968-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/13/2023] [Indexed: 02/25/2023] Open
Abstract
Biomedical datasets are increasing in size, stored in many repositories, and face challenges in FAIRness (findability, accessibility, interoperability, reusability). As a Consortium of infectious disease researchers from 15 Centers, we aim to adopt open science practices to promote transparency, encourage reproducibility, and accelerate research advances through data reuse. To improve FAIRness of our datasets and computational tools, we evaluated metadata standards across established biomedical data repositories. The vast majority do not adhere to a single standard, such as Schema.org, which is widely-adopted by generalist repositories. Consequently, datasets in these repositories are not findable in aggregation projects like Google Dataset Search. We alleviated this gap by creating a reusable metadata schema based on Schema.org and catalogued nearly 400 datasets and computational tools we collected. The approach is easily reusable to create schemas interoperable with community standards, but customized to a particular context. Our approach enabled data discovery, increased the reusability of datasets from a large research consortium, and accelerated research. Lastly, we discuss ongoing challenges with FAIRness beyond discoverability.
Collapse
Affiliation(s)
- Ginger Tsueng
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| | - Marco A Alvarado Cano
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - José Bento
- Department of Computer Science, Boston College, 245 Beacon St, Chestnut Hill, MA, 02467, USA
| | - Candice Czech
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Mengjia Kang
- Division of Pulmonary and Critical Care, Feinberg School of Medicine, Northwestern University, Chicago, IL, 60611, USA
| | - Lars Pache
- Infectious and Inflammatory Disease Center, Immunity and Pathogenesis Program, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, 92037, USA
| | - Luke V Rasmussen
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Tor C Savidge
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Justin Starren
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Qinglong Wu
- Texas Children's Microbiome Center & Department of Pathology & Immunology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Jiwen Xin
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Michael R Yeaman
- Department of Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Divisions of Molecular Medicine and Infectious Diseases, Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
- Lundquist Institute for Infection & Immunity at Harbor-UCLA Medical Center, Torrance, CA, 90502, USA
| | - Xinghua Zhou
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Andrew I Su
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Chunlei Wu
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA
- Scripps Research Translational Institute, La Jolla, CA, 92037, USA
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA, 92037, USA
| | - Liliana Brown
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Reed S Shabman
- Office of Genomics and Advanced Technologies, National Institute of Allergy and Infectious Diseases, Rockville, MD, 20852, USA
| | - Laura D Hughes
- Department of Integrative, Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, 92037, USA.
| |
Collapse
|
3
|
Li X, Zhang Y, Jin J, Sun F, Li N, Liang S. A model of integrating convolution and BiGRU dual-channel mechanism for Chinese medical text classifications. PLoS One 2023; 18:e0282824. [PMID: 36928266 PMCID: PMC10019650 DOI: 10.1371/journal.pone.0282824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 02/23/2023] [Indexed: 03/18/2023] Open
Abstract
Recently, a lot of Chinese patients consult treatment plans through social networking platforms, but the Chinese medical text contains rich information, including a large number of medical nomenclatures and symptom descriptions. How to build an intelligence model to automatically classify the text information consulted by patients and recommend the correct department for patients is very important. In order to address the problem of insufficient feature extraction from Chinese medical text and low accuracy, this paper proposes a dual channel Chinese medical text classification model. The model extracts feature of Chinese medical text at different granularity, comprehensively and accurately obtains effective feature information, and finally recommends departments for patients according to text classification. One channel of the model focuses on medical nomenclatures, symptoms and other words related to hospital departments, gives different weights, calculates corresponding feature vectors with convolution kernels of different sizes, and then obtains local text representation. The other channel uses the BiGRU network and attention mechanism to obtain text representation, highlighting the important information of the whole sentence, that is, global text representation. Finally, the model uses full connection layer to combine the representation vectors of the two channels, and uses Softmax classifier for classification. The experimental results show that the accuracy, recall and F1-score of the model are improved by 10.65%, 8.94% and 11.62% respectively compared with the baseline models in average, which proves that our model has better performance and robustness.
Collapse
Affiliation(s)
- Xiaoli Li
- School of Software, Henan University, Kaifeng, China
| | - Yuying Zhang
- School of Software, Henan University, Kaifeng, China
| | - Jiangyong Jin
- School of Software, Henan University, Kaifeng, China
| | - Fuqi Sun
- School of Software, Henan University, Kaifeng, China
| | - Na Li
- School of Digital Arts and Communication, Shandong University of Art & Design, Jinan, China
| | - Shengbin Liang
- School of Software, Henan University, Kaifeng, China
- Institute for Data Engineering and Science, University of Saint Joseph, Macao, China
- * E-mail:
| |
Collapse
|
4
|
Zhang Z. An improved BM25 algorithm for clinical decision support in Precision Medicine based on co-word analysis and Cuckoo Search. BMC Med Inform Decis Mak 2021; 21:81. [PMID: 33653325 PMCID: PMC7927407 DOI: 10.1186/s12911-021-01454-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Accepted: 02/23/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Retrieving gene and disease information from a vast collection of biomedical abstracts to provide doctors with clinical decision support is one of the important research directions of Precision Medicine. METHOD We propose a novel article retrieval method based on expanded word and co-word analyses, also conducting Cuckoo Search to optimize parameters of the retrieval function. The main goal is to retrieve the abstracts of biomedical articles that refer to treatments. The methods mentioned in this manuscript adopt the BM25 algorithm to calculate the score of abstracts. We, however, propose an improved version of BM25 that computes the scores of expanded words and co-word leading to a composite retrieval function, which is then optimized using the Cuckoo Search. The proposed method aims to find both disease and gene information in the abstract of the same biomedical article. This is to achieve higher relevance and hence score of articles. Besides, we investigate the influence of different parameters on the retrieval algorithm and summarize how they meet various retrieval needs. RESULTS The data used in this manuscript is sourced from medical articles presented in Text Retrieval Conference (TREC): Clinical Decision Support (CDS) Tracks of 2017, 2018, and 2019 in Precision Medicine. A total of 120 topics are tested. Three indicators are employed for the comparison of utilized methods, which are selected among the ones based only on the BM25 algorithm and its improved version to conduct comparable experiments. The results showed that the proposed algorithm achieves better results. CONCLUSION The proposed method, an improved version of the BM25 algorithm, utilizes both co-word implementation and Cuckoo Search, which has been verified achieving better results on a large number of experimental sets. Besides, a relatively simple query expansion method is implemented in this manuscript. Future research will focus on ontology and semantic networks to expand the query vocabulary.
Collapse
Affiliation(s)
- Zicheng Zhang
- School of Information Management, Nanjing University, Nanjing, 210023, China.
- Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing, 210023, China.
| |
Collapse
|
5
|
Zhang L, Hu J, Xu Q, Li F, Rao G, Tao C. A semantic relationship mining method among disorders, genes, and drugs from different biomedical datasets. BMC Med Inform Decis Mak 2020; 20:283. [PMID: 33317518 PMCID: PMC7734713 DOI: 10.1186/s12911-020-01274-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 09/22/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Semantic web technology has been applied widely in the biomedical informatics field. Large numbers of biomedical datasets are available online in the resource description framework (RDF) format. Semantic relationship mining among genes, disorders, and drugs is widely used in, for example, precision medicine and drug repositioning. However, most of the existing studies focused on a single dataset. It is not easy to find the most current relationships among disorder-gene-drug relationships since the relationships are distributed in heterogeneous datasets. How to mine their semantic relationships from different biomedical datasets is an important issue. METHODS First, a variety of biomedical datasets were converted into RDF triple data; then, multisource biomedical datasets were integrated into a storage system using a data integration algorithm. Second, nine query patterns among genes, disorders, and drugs from different biomedical datasets were designed. Third, the gene-disorder-drug semantic relationship mining algorithm is presented. This algorithm can query the relationships among various entities from different datasets. RESULTS AND CONCLUSIONS We focused on mining the putative and the most current disorder-gene-drug relationships about Parkinson's disease (PD). The results demonstrate that our method has significant advantages in mining and integrating multisource heterogeneous biomedical datasets. Twenty-five new relationships among the genes, disorders, and drugs were mined from four different datasets. The query results showed that most of them came from different datasets. The precision of the method increased by 2.51% compared to that of the multisource linked open data fusion method presented in the 4th International Workshop on Semantics-Powered Data Mining and Analytics (SEPDA 2019). Moreover, the number of query results increased by 7.7%, and the number of correct queries increased by 9.5%.
Collapse
Affiliation(s)
- Li Zhang
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Jiamei Hu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Qianzhi Xu
- School of Economics and Management, Tianjin University of Science and Technology, Tianjin, 300457 China
| | - Fang Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| | - Guozheng Rao
- College of Intelligence and Computing, Tianjin University, Tianjin, 300350 China
- Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, 300350 China
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St Suite 600, Houston, TX 77030 USA
| |
Collapse
|
6
|
Xu B, Lin H, Yang L, Xu K, Zhang Y, Zhang D, Yang Z, Wang J, Lin Y, Yin F. A supervised term ranking model for diversity enhanced biomedical information retrieval. BMC Bioinformatics 2019; 20:590. [PMID: 31787087 PMCID: PMC6886246 DOI: 10.1186/s12859-019-3080-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Background The number of biomedical research articles have increased exponentially with the advancement of biomedicine in recent years. These articles have thus brought a great difficulty in obtaining the needed information of researchers. Information retrieval technologies seek to tackle the problem. However, information needs cannot be completely satisfied by directly introducing the existing information retrieval techniques. Therefore, biomedical information retrieval not only focuses on the relevance of search results, but also aims to promote the completeness of the results, which is referred as the diversity-oriented retrieval. Results We address the diversity-oriented biomedical retrieval task using a supervised term ranking model. The model is learned through a supervised query expansion process for term refinement. Based on the model, the most relevant and diversified terms are selected to enrich the original query. The expanded query is then fed into a second retrieval to improve the relevance and diversity of search results. To this end, we propose three diversity-oriented optimization strategies in our model, including the diversified term labeling strategy, the biomedical resource-based term features and a diversity-oriented group sampling learning method. Experimental results on TREC Genomics collections demonstrate the effectiveness of the proposed model in improving the relevance and the diversity of search results. Conclusions The proposed three strategies jointly contribute to the improvement of biomedical retrieval performance. Our model yields more relevant and diversified results than the state-of-the-art baseline models. Moreover, our method provides a general framework for improving biomedical retrieval performance, and can be used as the basis for future work.
Collapse
Affiliation(s)
- Bo Xu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China. .,State Key Laboratory of Cognitive Intelligence,iFLYTEK, Hefei, People's Republic of China.
| | - Hongfei Lin
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China.
| | - Liang Yang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Kan Xu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Yijia Zhang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Dongyu Zhang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Zhihao Yang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Jian Wang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Yuan Lin
- WISE Lab, School of Public Administration and Law, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| | - Fuliang Yin
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Linggong Road, Dalian, People's Republic of China
| |
Collapse
|
7
|
Wang Y, Sohn S, Liu S, Shen F, Wang L, Atkinson EJ, Amin S, Liu H. A clinical text classification paradigm using weak supervision and deep representation. BMC Med Inform Decis Mak 2019; 19:1. [PMID: 30616584 PMCID: PMC6322223 DOI: 10.1186/s12911-018-0723-6] [Citation(s) in RCA: 171] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2018] [Accepted: 12/10/2018] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Automatic clinical text classification is a natural language processing (NLP) technology that unlocks information embedded in clinical narratives. Machine learning approaches have been shown to be effective for clinical text classification tasks. However, a successful machine learning model usually requires extensive human efforts to create labeled training data and conduct feature engineering. In this study, we propose a clinical text classification paradigm using weak supervision and deep representation to reduce these human efforts. METHODS We develop a rule-based NLP algorithm to automatically generate labels for the training data, and then use the pre-trained word embeddings as deep representation features for training machine learning models. Since machine learning is trained on labels generated by the automatic NLP algorithm, this training process is called weak supervision. We evaluat the paradigm effectiveness on two institutional case studies at Mayo Clinic: smoking status classification and proximal femur (hip) fracture classification, and one case study using a public dataset: the i2b2 2006 smoking status classification shared task. We test four widely used machine learning models, namely, Support Vector Machine (SVM), Random Forest (RF), Multilayer Perceptron Neural Networks (MLPNN), and Convolutional Neural Networks (CNN), using this paradigm. Precision, recall, and F1 score are used as metrics to evaluate performance. RESULTS CNN achieves the best performance in both institutional tasks (F1 score: 0.92 for Mayo Clinic smoking status classification and 0.97 for fracture classification). We show that word embeddings significantly outperform tf-idf and topic modeling features in the paradigm, and that CNN captures additional patterns from the weak supervision compared to the rule-based NLP algorithms. We also observe two drawbacks of the proposed paradigm that CNN is more sensitive to the size of training data, and that the proposed paradigm might not be effective for complex multiclass classification tasks. CONCLUSION The proposed clinical text classification paradigm could reduce human efforts of labeled training data creation and feature engineering for applying machine learning to clinical text classification by leveraging weak supervision and deep representation. The experimental experiments have validated the effectiveness of paradigm by two institutional and one shared clinical text classification tasks.
Collapse
Affiliation(s)
- Yanshan Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Sunghwan Sohn
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Sijia Liu
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Feichen Shen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Liwei Wang
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Elizabeth J. Atkinson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Shreyasee Amin
- Division of Rheumatology, Department of Medicine, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
- Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| | - Hongfang Liu
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, 200 1st ST SW, Rochester, MN 55905 USA
| |
Collapse
|
8
|
|