1
|
Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics 2024; 24:139. [PMID: 39158621 DOI: 10.1007/s10142-024-01415-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 07/30/2024] [Accepted: 08/01/2024] [Indexed: 08/20/2024]
Abstract
Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.
Collapse
Affiliation(s)
- Kasmika Borah
- Department of Computer Science and Information Technology, Cotton University, Panbazar, Guwahati, 781001, Assam, India
| | - Himanish Shekhar Das
- Department of Computer Science and Information Technology, Cotton University, Panbazar, Guwahati, 781001, Assam, India.
| | - Soumita Seth
- Department of Computer Science and Engineering, Future Institute of Engineering and Management, Narendrapur, Kolkata, 700150, West Bengal, India
| | - Koushik Mallick
- Department of Computer Science and Engineering, RCC Institute of Information Technology, Canal S Rd, Beleghata, Kolkata, 700015, West Bengal, India
| | | | - Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA, 02115, USA.
- Department of Pharmacology & Toxicology, University of Arizona, Tucson, AZ, 85721, USA.
| |
Collapse
|
2
|
Chafai N, Bonizzi L, Botti S, Badaoui B. Emerging applications of machine learning in genomic medicine and healthcare. Crit Rev Clin Lab Sci 2024; 61:140-163. [PMID: 37815417 DOI: 10.1080/10408363.2023.2259466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Accepted: 09/12/2023] [Indexed: 10/11/2023]
Abstract
The integration of artificial intelligence technologies has propelled the progress of clinical and genomic medicine in recent years. The significant increase in computing power has facilitated the ability of artificial intelligence models to analyze and extract features from extensive medical data and images, thereby contributing to the advancement of intelligent diagnostic tools. Artificial intelligence (AI) models have been utilized in the field of personalized medicine to integrate clinical data and genomic information of patients. This integration allows for the identification of customized treatment recommendations, ultimately leading to enhanced patient outcomes. Notwithstanding the notable advancements, the application of artificial intelligence (AI) in the field of medicine is impeded by various obstacles such as the limited availability of clinical and genomic data, the diversity of datasets, ethical implications, and the inconclusive interpretation of AI models' results. In this review, a comprehensive evaluation of multiple machine learning algorithms utilized in the fields of clinical and genomic medicine is conducted. Furthermore, we present an overview of the implementation of artificial intelligence (AI) in the fields of clinical medicine, drug discovery, and genomic medicine. Finally, a number of constraints pertaining to the implementation of artificial intelligence within the healthcare industry are examined.
Collapse
Affiliation(s)
- Narjice Chafai
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
| | - Luigi Bonizzi
- Department of Biomedical, Surgical and Dental Science, University of Milan, Milan, Italy
| | - Sara Botti
- PTP Science Park, Via Einstein - Loc. Cascina Codazza, Lodi, Italy
| | - Bouabid Badaoui
- Laboratory of Biodiversity, Ecology, and Genome, Faculty of Sciences, Department of Biology, Mohammed V University in Rabat, Rabat, Morocco
- African Sustainable Agriculture Research Institute (ASARI), Mohammed VI Polytechnic University (UM6P), Laâyoune, Morocco
| |
Collapse
|
3
|
Zhang X. Emotional Intervention and Education System Construction for Rural Children Based on Semantic Analysis. Occup Ther Int 2022; 2022:1073717. [PMID: 35874601 PMCID: PMC9273381 DOI: 10.1155/2022/1073717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 06/14/2022] [Accepted: 06/17/2022] [Indexed: 11/29/2022] Open
Abstract
Objective Under the background of the policy of caring for the healthy growth of left-behind children, the purpose of selecting the topic is to study some common negative emotional problems of left-behind children in rural areas, focusing on the guidance of negative emotions of left-behind children in rural areas. In emotional problems, we analyze and find out the reasons for these negative emotions through observation and research. Method In this paper, a platform for acquiring emotional semantic data of scene images in an open behavioral experimental environment is designed, which breaks the limitations of time and place, and thus acquires a large amount of emotional semantic data of scene images and then uses principal component analysis to evaluate the validity of the data analysis. Psychological testing was used to measure parent-child affinity, adversity beliefs, and positive/negative emotion scales, respectively, to examine children whose parents went out, children whose fathers went out, and non-left-behind children. The characteristics of parent-child affinity, adversity beliefs, and positive/negative emotions in three types of children were examined, and the direct predictive effects of parent-child affinity and adversity beliefs on the positive/negative emotions of the three types of children were examined. Results/Discussion. Adversity beliefs played a partial mediating role between children's parent-child bonding and positive emotions. The predictive effect of adversity beliefs on children's emotional adaptation differs by emotional type. The main effects of the left-behind category were significant for both positive and negative emotions. The gender main effect of negative emotion was significant, and the negative emotion level of girls was significantly higher than that of boys. The main effect of the left-behind category of adversity beliefs was significant, and the adversity belief levels of children whose parents went out to rural areas were significantly lower than those of children whose fathers went out and non-left-behind children. The negative emotions generated by left-behind children in rural areas are channeled, and to a certain extent, they are improved and alleviated. Through the emotional counseling and improvement of the rural left-behind children in the research site in the article, the service objects can have better emotions, promote mental health, make them happy and grow up healthily, and also provide a certain theory for the establishment of the local left-behind children care system.
Collapse
Affiliation(s)
- Xiaobo Zhang
- School of Education Science, Xinyang Normal University, Xinyang, Henan 464000, China
| |
Collapse
|
4
|
|
5
|
García-Pedrajas N, Cerruela-García G. MABUSE: A margin optimization based feature subset selection algorithm using boosting principles. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
6
|
Xu D, Xu H, Zhang Y, Gao R. Novel Collaborative Weighted Non-negative Matrix Factorization Improves Prediction of Disease-Associated Human Microbes. Front Microbiol 2022; 13:834982. [PMID: 35369503 PMCID: PMC8965656 DOI: 10.3389/fmicb.2022.834982] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 01/19/2022] [Indexed: 12/14/2022] Open
Abstract
Extensive clinical and biomedical studies have shown that microbiome plays a prominent role in human health. Identifying potential microbe–disease associations (MDAs) can help reveal the pathological mechanism of human diseases and be useful for the prevention, diagnosis, and treatment of human diseases. Therefore, it is necessary to develop effective computational models and reduce the cost and time of biological experiments. Here, we developed a novel machine learning-based joint framework called CWNMF-GLapRLS for human MDA prediction using the proposed collaborative weighted non-negative matrix factorization (CWNMF) technique and graph Laplacian regularized least squares. Especially, to fuse more similarity information, we calculated the functional similarity of microbes. To deal with missing values and effectively overcome the data sparsity problem, we proposed a collaborative weighted NMF technique to reconstruct the original association matrix. In addition, we developed a graph Laplacian regularized least-squares method for prediction. The experimental results of fivefold and leave-one-out cross-validation demonstrated that our method achieved the best performance by comparing it with 5 state-of-the-art methods on the benchmark dataset. Case studies further showed that the proposed method is an effective tool to predict potential MDAs and can provide more help for biomedical researchers.
Collapse
Affiliation(s)
- Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, China
- *Correspondence: Yusen Zhang,
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, China
- Rui Gao,
| |
Collapse
|
7
|
Jaddi NS, Saniee Abadeh M. Cell separation algorithm with enhanced search behaviour in miRNA feature selection for cancer diagnosis. INFORM SYST 2022. [DOI: 10.1016/j.is.2021.101906] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
8
|
Li Z, Du J, Nie B, Xiong W, Xu G, Luo J. A new two-stage hybrid feature selection algorithm and its application in Chinese medicine. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01445-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
9
|
Asad E, Mollah AF. Biomarker Identification From Gene Expression Based on Symmetrical Uncertainty. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES 2021. [DOI: 10.4018/ijiit.289966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In this paper, we present an effective information theoretic feature selection method, Symmetrical Uncertainty to classify gene expression microarray data and detect biomarkers from it. Here, Information Gain and Symmetrical Uncertainty contribute for ranking the features. Based on computed values of Symmetrical Uncertainty, features were sorted from most informative to least informative ones. Then, the top features from the sorted list are passed to Random Forest, Logistic Regression and other well-known classifiers with Leave-One-Out cross validation to construct the best classification model(s) and accordingly select the most important genes from microarray datasets. Obtained results in terms of classification accuracy, running time, root mean square error and other parameters computed on Leukemia and Colon cancer datasets demonstrate the effectiveness of the proposed approach. The proposed method is relatively much faster than many other wrapper or ensemble methods.
Collapse
|
10
|
Sheikhi G, Altınçay H. A novel dissimilarity metric based on feature‐to‐feature scatter frequencies for clustering‐based feature selection in biomedical data. Comput Intell 2021. [DOI: 10.1111/coin.12470] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ghazaal Sheikhi
- Department of Computer Engineering Final International University Kyrenia North Cyprus Turkey
| | - Hakan Altınçay
- Department of Computer Engineering Eastern Mediterranean University Famagusta North Cyprus Turkey
| |
Collapse
|
11
|
Xu D, Xu H, Zhang Y, Wang M, Chen W, Gao R. MDAKRLS: Predicting human microbe-disease association based on Kronecker regularized least squares and similarities. J Transl Med 2021; 19:66. [PMID: 33579301 PMCID: PMC7881563 DOI: 10.1186/s12967-021-02732-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Accepted: 02/01/2021] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Microbes are closely related to human health and diseases. Identification of disease-related microbes is of great significance for revealing the pathological mechanism of human diseases and understanding the interaction mechanisms between microbes and humans, which is also useful for the prevention, diagnosis and treatment of human diseases. Considering the known disease-related microbes are still insufficient, it is necessary to develop effective computational methods and reduce the time and cost of biological experiments. METHODS In this work, we developed a novel computational method called MDAKRLS to discover potential microbe-disease associations (MDAs) based on the Kronecker regularized least squares. Specifically, we introduced the Hamming interaction profile similarity to measure the similarities of microbes and diseases besides Gaussian interaction profile kernel similarity. In addition, we introduced the Kronecker product to construct two kinds of Kronecker similarities between microbe-disease pairs. Then, we designed the Kronecker regularized least squares with different Kronecker similarities to obtain prediction scores, respectively, and calculated the final prediction scores by integrating the contributions of different similarities. RESULTS The AUCs value of global leave-one-out cross-validation and 5-fold cross-validation achieved by MDAKRLS were 0.9327 and 0.9023 ± 0.0015, which were significantly higher than five state-of-the-art methods used for comparison. Comparison results demonstrate that MDAKRLS has faster computing speed under two kinds of frameworks. In addition, case studies of inflammatory bowel disease (IBD) and asthma further showed 19 (IBD), 19 (asthma) of the top 20 prediction disease-related microbes could be verified by previously published biological or medical literature. CONCLUSIONS All the evaluation results adequately demonstrated that MDAKRLS has an effective and reliable prediction performance. It may be a useful tool to seek disease-related new microbes and help biomedical researchers to carry out follow-up studies.
Collapse
Affiliation(s)
- Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| | - Mingyi Wang
- Department of Central Lab, Weihai Municipal Hospital, Cheeloo College of Medicine, Shandong University, Weihai, Shandong, China.
| | - Wei Chen
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, 250061, China
| |
Collapse
|
12
|
Xu H, Xu D, Zhang N, Zhang Y, Gao R. Protein-Protein Interaction Prediction Based on Spectral Radius and General Regression Neural Network. J Proteome Res 2021; 20:1657-1665. [PMID: 33555893 DOI: 10.1021/acs.jproteome.0c00871] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Protein-protein interaction (PPI) not only plays a critical role in cell life activities, but also plays an important role in discovering the mechanism of biological activity, protein function, and disease states. Developing computational methods is of great significance for PPIs prediction since experimental methods are time-consuming and laborious. In this paper, we proposed a PPI prediction algorithm called GRNN-PPI only using the amino acid sequence information based on general regression neural network and two feature extraction methods. Specifically, we designed a new feature extraction method named Mutation Spectral Radius (MSR) to extract evolutionary information by the BLOSUM62 matrix. Meanwhile, we integrated another feature extraction method, autocorrelation description, which can completely extract information on physicochemical properties and protein sequences. The principal component analysis was applied to eliminate noise, and the general regression neural network was adopted as a classifier. The prediction accuracy of the yeast, human, and Helicobacter pylori1 (H. pylori1) data sets were 97.47%, 99.63%, and 99.97%, respectively. In addition, we also conducted experiments on two important PPI networks and six independent data sets. All results were significantly higher than some state-of-the-art methods used for comparison, showing that our method is feasible and robust.
Collapse
Affiliation(s)
- Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai 264209, China
| | - Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai 264209, China
| | - Naiqian Zhang
- School of Mathematics and Statistics, Shandong University, Weihai 264209, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai 264209, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250061, China
| |
Collapse
|
13
|
Deng F, Shen L, Wang H, Zhang L. Classify multicategory outcome in patients with lung adenocarcinoma using clinical, transcriptomic and clinico-transcriptomic data: machine learning versus multinomial models. Am J Cancer Res 2020; 10:4624-4639. [PMID: 33415023 PMCID: PMC7783755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 11/25/2020] [Indexed: 06/12/2023] Open
Abstract
Classification of multicategory survival-outcome is important for precision oncology. Machine learning (ML) algorithms have been used to accurately classify multi-category survival-outcome of some cancer-types, but not yet that of lung adenocarcinoma. Therefore, we compared the performances of 3 ML models (random forests, support vector machine [SVM], multilayer perceptron) and multinomial logistic regression (Mlogit) models for classifying 4-category survival-outcome of lung adenocarcinoma using the TCGA. Mlogit model overall performed similar to SVM and multilayer perceptron models (micro-average area under curve=0.82), while random forests model was inferior. Surprisingly, transcriptomic data alone and clinico-transcriptomic data appeared sufficient to accurately classify the 4-category survival-outcome in these patients, but no models using clinical data alone performed well. Notably, NDUFS5, P2RY2, PRPF18, CCL24, ZNF813, MYL6, FLJ41941, POU5F1B, and SUV420H1 were the top-ranked genes that were associated with alive without disease and inversely linked to other outcomes. Similarly, BDKRB2, TERC, DNAJA3, MRPL15, SLC16A13, CRHBP and ACSBG2 were associated with alive with progression and GAL3ST3, AD2, RAB41, HDC, and PLEKHG1 associated with dead with disease, respectively, while also inversely linked other outcomes. These cross-linked genes may be used for risk-stratification and future treatment development.
Collapse
Affiliation(s)
- Fei Deng
- School of Electrical and Electronic Engineering, Shanghai Institute of TechnologyShanghai, China
| | - Lanlan Shen
- Department of Pediatrics, Baylor College of Medicine, USDA/ARS Children’s Nutrition Research CenterHouston, TX, USA
| | - He Wang
- Department of Pathology, Yale University School of MedicineNew Haven, CT, USA
| | - Lanjing Zhang
- Department of Pathology, Princeton Medical CenterPlainsboro, NJ, USA
- Department of Biological Sciences, Rutgers UniversityNewark, NJ
- Rutgers Cancer Institute of New JerseyNew Brunswick, NJ, USA
- Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers UniversityPiscataway, NJ, USA
| |
Collapse
|