1
|
Zhou DD, Peng XY, Zhao L, Ma LL, Hu JH, Jiang ZH, He XQ, Wang W, Chen R, Kuang L. Neurophysiological biomarkers for depression classification: Utilizing microstate k-mers and a bag-of-words model. J Psychiatr Res 2023; 165:197-204. [PMID: 37517240 DOI: 10.1016/j.jpsychires.2023.07.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 06/30/2023] [Accepted: 07/16/2023] [Indexed: 08/01/2023]
Abstract
Microstates are analogous to characters in a language, and short fragments consisting of several microstates (k-mers) are analogous to words. We aimed to investigate whether microstate k-mers could be used as neurophysiological biomarkers to differentiate between depressed patients and normal controls. We utilized a bag-of-words model to process microstate sequences, using k-mers with a k range of 1-10 as terms, and the term frequency (TF) with or without inverse-document-frequency (IDF) as features. We performed nested cross-validation on Dataset 1 (27 patients and 26 controls) and Dataset 2 (34 patients and 30 controls) separately and then trained on one dataset and tested on the other. The best area under the curve (AUC) of 81.5% was achieved for the model with L1 regularization using the TF of 4-mers as features in Dataset 1, and the best AUC of 88.9% was achieved for the model with L1 regularization using the TF of 9-mers as features in Dataset 2. When Dataset 1 was used as the training set, the best AUC of predicting Dataset 2 was 74.1% for the model with L2 regularization using the TF-IDF of 9-mers as features, while the best AUC of predicting Dataset 1 was 70.2% for the model with L1 regularization using the TF of 8-mers as features. Our study provided novel insights into the potential of microstate k-mers as neurophysiological biomarkers for individual-level classification of depression. These may facilitate further exploration of microstate sequences using natural language processing techniques.
Collapse
Affiliation(s)
- Dong-Dong Zhou
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Xin-Yu Peng
- Department of Psychiatry, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Lin Zhao
- Department of Psychiatry, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Ling-Li Ma
- Department of Psychiatry, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Jin-Hui Hu
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Zheng-Hao Jiang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Xiao-Qing He
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Wo Wang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Ran Chen
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China.
| | - Li Kuang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China; Department of Psychiatry, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China.
| |
Collapse
|
2
|
Olaleye T, Abayomi-Alli A, Adesemowo K, Arogundade OT, Misra S, Kose U. SCLAVOEM: hyper parameter optimization approach to predictive modelling of COVID-19 infodemic tweets using smote and classifier vote ensemble. Soft comput 2023; 27:3531-50. [PMID: 35309597 DOI: 10.1007/s00500-022-06940-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/21/2022] [Indexed: 11/23/2022]
Abstract
Fake COVID-19 tweets are dangerous since they are misinformative, completely inaccurate, as threatening the efforts for flattening the pandemic curve. Thus, aside the COVID-19 pandemic, dealing with fake news and myths about the virus constitute an infodemic issue, which must be tackled by ensuring only valid information. In this context, this study proposed the Synthetic Minority Over-Sampling Technique (SMOTE) and the classifier vote ensemble (SCLAVOEM) method as a fake news classifier and a hyper parameter optimization approach for predictive modelling of COVID-19 infodemic tweets. Hyper parameter optimization variables were deployed across specific points of the proposed model and a minority oversampling of training sets was applied within imbalanced class representations. Experimental applications by the SCLAVOEM for COVID-19 infodemic prediction returned 0.999 and 1.000 weighted averages for F-measure and area under curve (AUC), respectively. Thanks to the SMOTE, the performance increases of 3.74 and 1.11%; 5.05 and 0.29%; 4.59 and 8.05% was seen in three different data sets. Eventually, the SCLAVOEM provided a framework for predictive detecting 'fake tweets' and three classifiers: 'positive', 'negative' and 'click-trap' (piège à clics). It is thought that the model will automatically flag fake information on Twitter, hence protecting the public from inaccurate and information overload.
Collapse
|
3
|
Qiu W, Lv Z, Xiao X, Shao S, Lin H. EMCBOW-GPCR: A method for identifying G-protein coupled receptors based on word embedding and wordbooks. Comput Struct Biotechnol J 2021; 19:4961-4969. [PMID: 34527200 PMCID: PMC8437786 DOI: 10.1016/j.csbj.2021.08.044] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 08/07/2021] [Accepted: 08/27/2021] [Indexed: 11/15/2022] Open
Abstract
An computational method was developed to identify G-protein coupled receptors. Three word-embedding models and a bag-of-words model are used to extract original features. A high accuracy was achieved by using fusion information. A powerful tool was established.
G Protein-Coupled Receptors (GPCRs) are one of the largest membrane protein receptor family in human, which are also important targets for many drugs. Thence, it’s of great significance to judge whether a protein is a GPCR or not. However, identifying GPCRs by experimental methods is very expensive and time-consuming. As more and more GPCR primary sequences are accumulated, it’s feasible to develop a computational model to predict GPCRs precisely and quickly. In this paper, a novel method called EMCBOW-GPCR has been proposed to improve the accuracy of identifying GPCRs based on natural language processing (NLP). For representing GPCRs, three word-embedding models and a bag-of-words model are used to extract original features. Then, the original features are thrown into a Deep-learning algorithm to extract features further and reduce the dimension. Finally, the obtained features are fed into Extreme Gradient Boosting. As shown with the results comparison, the overall prediction metrics of EMCBOW-GPCR are higher than the state of the arts. In order to be convenient for more researchers to use EMCBOW-GPCR, the method and source code have been opened in github, which are available at https://github.com/454170054/EMCBOW-GPCR, and a user-friendly web-server for EMCBOW-GPCR has been established at http://www.jci-bioinfo.cn/emcbowgpcr.
Collapse
Affiliation(s)
- Wangren Qiu
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Zhe Lv
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Xuan Xiao
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Shuai Shao
- School of Information Engineering, Jingdezhen Ceramic Institute, Jingdezhen, China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
4
|
Bhatt N, Ganatra A. Improvement of deep cross-modal retrieval by generating real-valued representation. PeerJ Comput Sci 2021; 7:e491. [PMID: 33987458 PMCID: PMC8093956 DOI: 10.7717/peerj-cs.491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 03/24/2021] [Indexed: 06/12/2023]
Abstract
The cross-modal retrieval (CMR) has attracted much attention in the research community due to flexible and comprehensive retrieval. The core challenge in CMR is the heterogeneity gap, which is generated due to different statistical properties of multi-modal data. The most common solution to bridge the heterogeneity gap is representation learning, which generates a common sub-space. In this work, we propose a framework called "Improvement of Deep Cross-Modal Retrieval (IDCMR)", which generates real-valued representation. The IDCMR preserves both intra-modal and inter-modal similarity. The intra-modal similarity is preserved by selecting an appropriate training model for text and image modality. The inter-modal similarity is preserved by reducing modality-invariance loss. The mean average precision (mAP) is used as a performance measure in the CMR system. Extensive experiments are performed, and results show that IDCMR outperforms over state-of-the-art methods by a margin 4% and 2% relatively with mAP in the text to image and image to text retrieval tasks on MSCOCO and Xmedia dataset respectively.
Collapse
Affiliation(s)
- Nikita Bhatt
- U & P U. Patel Department of Computer Engineering, Chandubhai S. Patel Institute of Technology, Charotar University of Science and Technology (CHARUSAT), Changa, India
| | - Amit Ganatra
- Devang Patel Institute of Advance Technology and Research, Charotar University of Science and Technology (CHARUSAT), Changa, India
| |
Collapse
|
5
|
Abstract
BACKGROUND G protein-coupled receptors (GPCRs) mediate a variety of important physiological functions, are closely related to many diseases, and constitute the most important target family of modern drugs. Therefore, the research of GPCR analysis and GPCR ligand screening is the hotspot of new drug development. Accurately identifying the GPCR-drug interaction is one of the key steps for designing GPCR-targeted drugs. However, it is prohibitively expensive to experimentally ascertain the interaction of GPCR-drug pairs on a large scale. Therefore, it is of great significance to predict the interaction of GPCR-drug pairs directly from the molecular sequences. With the accumulation of known GPCR-drug interaction data, it is feasible to develop sequence-based machine learning models for query GPCR-drug pairs. RESULTS In this paper, a new sequence-based method is proposed to identify GPCR-drug interactions. For GPCRs, we use a novel bag-of-words (BoW) model to extract sequence features, which can extract more pattern information from low-order to high-order and limit the feature space dimension. For drug molecules, we use discrete Fourier transform (DFT) to extract higher-order pattern information from the original molecular fingerprints. The feature vectors of two kinds of molecules are concatenated and input into a simple prediction engine distance-weighted K-nearest-neighbor (DWKNN). This basic method is easy to be enhanced through ensemble learning. Through testing on recently constructed GPCR-drug interaction datasets, it is found that the proposed methods are better than the existing sequence-based machine learning methods in generalization ability, even an unconventional method in which the prediction performance was further improved by post-processing procedure (PPP). CONCLUSIONS The proposed methods are effective for GPCR-drug interaction prediction, and may also be potential methods for other target-drug interaction prediction, or protein-protein interaction prediction. In addition, the new proposed feature extraction method for GPCR sequences is the modified version of the traditional BoW model and may be useful to solve problems of protein classification or attribute prediction. The source code of the proposed methods is freely available for academic research at https://github.com/wp3751/GPCR-Drug-Interaction.
Collapse
Affiliation(s)
- Pu Wang
- Computer School, Hubei University of Arts and Science, Xiangyang, 441053 China
| | - Xiaotong Huang
- Computer School, Hubei University of Arts and Science, Xiangyang, 441053 China
| | - Wangren Qiu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403 China
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen, 333403 China
| |
Collapse
|
6
|
Adeli E, Meng Y, Li G, Lin W, Shen D. Multi-task prediction of infant cognitive scores from longitudinal incomplete neuroimaging data. Neuroimage 2018; 185:783-792. [PMID: 29709627 DOI: 10.1016/j.neuroimage.2018.04.052] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2017] [Revised: 03/26/2018] [Accepted: 04/23/2018] [Indexed: 01/13/2023] Open
Abstract
Early postnatal brain undergoes a stunning period of development. Over the past few years, research on dynamic infant brain development has received increased attention, exhibiting how important the early stages of a child's life are in terms of brain development. To precisely chart the early brain developmental trajectories, longitudinal studies with data acquired over a long-enough period of infants' early life is essential. However, in practice, missing data from different time point(s) during the data gathering procedure is often inevitable. This leads to incomplete set of longitudinal data, which poses a major challenge for such studies. In this paper, prediction of multiple future cognitive scores with incomplete longitudinal imaging data is modeled into a multi-task machine learning framework. To efficiently learn this model, we account for selection of informative features (i.e., neuroimaging morphometric measurements for different time points), while preserving the structural information and the interrelation between these multiple cognitive scores. Several experiments are conducted on a carefully acquired in-house dataset, and the results affirm that we can predict the cognitive scores measured at the age of four years old, using the imaging data of earlier time points, as early as 24 months of age, with a reasonable performance (i.e., root mean square error of 0.18).
Collapse
Affiliation(s)
- Ehsan Adeli
- Department of Radiology and BRIC, University of North Carolina at Chapel Hill, NC 27599, United States; Department of Psychiatry & Behavioral Sciences, Stanford University, Stanford, CA 94305, United States.
| | - Yu Meng
- Department of Radiology and BRIC, University of North Carolina at Chapel Hill, NC 27599, United States; Department of Computer Science, University of North Carolina at Chapel Hill, NC 27599, United States
| | - Gang Li
- Department of Radiology and BRIC, University of North Carolina at Chapel Hill, NC 27599, United States
| | - Weili Lin
- Department of Radiology and BRIC, University of North Carolina at Chapel Hill, NC 27599, United States
| | - Dinggang Shen
- Department of Radiology and BRIC, University of North Carolina at Chapel Hill, NC 27599, United States; Department of Brain & Cognitive Eng, Korea University, Seoul, 02841, Republic of Korea.
| |
Collapse
|
7
|
Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE. Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach. PeerJ 2015; 3:e1279. [PMID: 26468436 PMCID: PMC4592155 DOI: 10.7717/peerj.1279] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2015] [Accepted: 09/07/2015] [Indexed: 11/22/2022] Open
Abstract
Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.
Collapse
|
8
|
Wang J, Sun X, Nahavandi S, Kouzani A, Wu Y, She M. Multichannel biomedical time series clustering via hierarchical probabilistic latent semantic analysis. Comput Methods Programs Biomed 2014; 117:238-246. [PMID: 25023531 DOI: 10.1016/j.cmpb.2014.06.014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2013] [Revised: 06/20/2014] [Accepted: 06/22/2014] [Indexed: 06/03/2023]
Abstract
Biomedical time series clustering that automatically groups a collection of time series according to their internal similarity is of importance for medical record management and inspection such as bio-signals archiving and retrieval. In this paper, a novel framework that automatically groups a set of unlabelled multichannel biomedical time series according to their internal structural similarity is proposed. Specifically, we treat a multichannel biomedical time series as a document and extract local segments from the time series as words. We extend a topic model, i.e., the Hierarchical probabilistic Latent Semantic Analysis (H-pLSA), which was originally developed for visual motion analysis to cluster a set of unlabelled multichannel time series. The H-pLSA models each channel of the multichannel time series using a local pLSA in the first layer. The topics learned in the local pLSA are then fed to a global pLSA in the second layer to discover the categories of multichannel time series. Experiments on a dataset extracted from multichannel Electrocardiography (ECG) signals demonstrate that the proposed method performs better than previous state-of-the-art approaches and is relatively robust to the variations of parameters including length of local segments and dictionary size. Although the experimental evaluation used the multichannel ECG signals in a biometric scenario, the proposed algorithm is a universal framework for multichannel biomedical time series clustering according to their structural similarity, which has many applications in biomedical time series management.
Collapse
Affiliation(s)
- Jin Wang
- School of Computer Science & Software Engineering, The University of Western Australia, Australia; Center for Intelligent Systems Research, Deakin University, Australia.
| | - Xiangping Sun
- Center for Intelligent Systems Research, Deakin University, Australia; School of Engineering, Deakin University, Australia
| | - Saeid Nahavandi
- Center for Intelligent Systems Research, Deakin University, Australia
| | | | - Yuchuan Wu
- School of Mechanical Engineering and Automation, Wuhan Textile University, Wuhan, PR China
| | - Mary She
- Center for Intelligent Systems Research, Deakin University, Australia; School of Engineering, Deakin University, Australia
| |
Collapse
|
9
|
Keraudren K, Kuklisova-Murgasova M, Kyriakopoulou V, Malamateniou C, Rutherford MA, Kainz B, Hajnal JV, Rueckert D. Automated fetal brain segmentation from 2D MRI slices for motion correction. Neuroimage 2014; 101:633-43. [PMID: 25058899 DOI: 10.1016/j.neuroimage.2014.07.023] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2014] [Revised: 06/05/2014] [Accepted: 07/15/2014] [Indexed: 01/18/2023] Open
Abstract
Motion correction is a key element for imaging the fetal brain in-utero using Magnetic Resonance Imaging (MRI). Maternal breathing can introduce motion, but a larger effect is frequently due to fetal movement within the womb. Consequently, imaging is frequently performed slice-by-slice using single shot techniques, which are then combined into volumetric images using slice-to-volume reconstruction methods (SVR). For successful SVR, a key preprocessing step is to isolate fetal brain tissues from maternal anatomy before correcting for the motion of the fetal head. This has hitherto been a manual or semi-automatic procedure. We propose an automatic method to localize and segment the brain of the fetus when the image data is acquired as stacks of 2D slices with anatomy misaligned due to fetal motion. We combine this segmentation process with a robust motion correction method, enabling the segmentation to be refined as the reconstruction proceeds. The fetal brain localization process uses Maximally Stable Extremal Regions (MSER), which are classified using a Bag-of-Words model with Scale-Invariant Feature Transform (SIFT) features. The segmentation process is a patch-based propagation of the MSER regions selected during detection, combined with a Conditional Random Field (CRF). The gestational age (GA) is used to incorporate prior knowledge about the size and volume of the fetal brain into the detection and segmentation process. The method was tested in a ten-fold cross-validation experiment on 66 datasets of healthy fetuses whose GA ranged from 22 to 39 weeks. In 85% of the tested cases, our proposed method produced a motion corrected volume of a relevant quality for clinical diagnosis, thus removing the need for manually delineating the contours of the brain before motion correction. Our method automatically generated as a side-product a segmentation of the reconstructed fetal brain with a mean Dice score of 93%, which can be used for further processing.
Collapse
|