1
|
Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025; 2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]
Abstract
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Faiza Hassan
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
2
|
Er AG, Ding DY, Er B, Uzun M, Cakmak M, Sadee C, Durhan G, Ozmen MN, Tanriover MD, Topeli A, Aydin Son Y, Tibshirani R, Unal S, Gevaert O. Multimodal data fusion using sparse canonical correlation analysis and cooperative learning: a COVID-19 cohort study. NPJ Digit Med 2024; 7:117. [PMID: 38714751 PMCID: PMC11076490 DOI: 10.1038/s41746-024-01128-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Accepted: 04/25/2024] [Indexed: 05/10/2024] Open
Abstract
Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients: Intensive care unit admission. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (cor(Xu1, Zv1) = 0.596, p value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.
Collapse
Affiliation(s)
- Ahmet Gorkem Er
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA.
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, 06800, Ankara, Turkey.
- Department of Infectious Diseases and Clinical Microbiology, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey.
| | - Daisy Yi Ding
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
| | - Berrin Er
- Department of Internal Medicine, Division of Intensive Care Medicine, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Mertcan Uzun
- Department of Infectious Diseases and Clinical Microbiology, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Mehmet Cakmak
- Department of Internal Medicine, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Christoph Sadee
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA
| | - Gamze Durhan
- Department of Radiology, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Mustafa Nasuh Ozmen
- Department of Radiology, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Mine Durusu Tanriover
- Department of Internal Medicine, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Arzu Topeli
- Department of Internal Medicine, Division of Intensive Care Medicine, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Yesim Aydin Son
- Department of Health Informatics, Graduate School of Informatics, Middle East Technical University, 06800, Ankara, Turkey
| | - Robert Tibshirani
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
| | - Serhat Unal
- Department of Infectious Diseases and Clinical Microbiology, Hacettepe University Faculty of Medicine, 06230, Ankara, Turkey
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University, Stanford, CA, 94305, USA.
- Department of Biomedical Data Science, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
3
|
Dhibar S, Jana B. Accurate Prediction of Antifreeze Protein from Sequences through Natural Language Text Processing and Interpretable Machine Learning Approaches. J Phys Chem Lett 2023; 14:10727-10735. [PMID: 38009833 DOI: 10.1021/acs.jpclett.3c02817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Antifreeze proteins (AFPs) bind to growing iceplanes owing to their structural complementarity nature, thereby inhibiting the ice-crystal growth by thermal hysteresis. Classification of AFPs from sequence is a difficult task due to their low sequence similarity, and therefore, the usual sequence similarity algorithms, like Blast and PSI-Blast, are not efficient. Here, a method combining n-gram feature vectors and machine learning models to accelerate the identification of potential AFPs from sequences is proposed. All these n-gram features are extracted from the K-mer counting method. The comparative analysis reveals that, among different machine learning models, Xgboost outperforms others in predicting AFPs from sequence when penta-mers are used as a feature vector. When tested on an independent dataset, our method performed better compared to other existing ones with sensitivity of 97.50%, recall of 98.30%, and f1 score of 99.10%. Further, we used the SHAP method, which provides important insight into the functional activity of AFPs.
Collapse
Affiliation(s)
- Saikat Dhibar
- School of Chemical Sciences, Indian Association for the Cultivation of Science, Jadavpur, Kolkata 700032, India
| | - Biman Jana
- School of Chemical Sciences, Indian Association for the Cultivation of Science, Jadavpur, Kolkata 700032, India
| |
Collapse
|
4
|
Samy SS, Karthick S, Ghosal M, Singh S, Sudarsan JS, Nithiyanantham S. Adoption of machine learning algorithm for predicting the length of stay of patients (construction workers) during COVID pandemic. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY : AN OFFICIAL JOURNAL OF BHARATI VIDYAPEETH'S INSTITUTE OF COMPUTER APPLICATIONS AND MANAGEMENT 2023; 15:1-9. [PMID: 37360312 PMCID: PMC10250170 DOI: 10.1007/s41870-023-01296-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 05/15/2023] [Indexed: 06/28/2023]
Abstract
The construction sector in a rapidly developing country like India is a very unorganized sector. A large number of workers were affected and hospitalized during the pandemic. This situation is costing the sector heavily in several respects. This research study was conducted as part of using machine learning algorithms to improve construction company health and safety policies. LOS (length of stay) is used to predict how long a patient will stay in a hospital. Predicting LOS is very useful not only for hospitals, but also for construction companies to measure resources and reduce costs. Predicting LOS has become an important step in most hospitals before admitting patients. In this post, we used the Medical Information Mart for Intensive Care(MIMIC III) dataset and applied four different machine learning algorithms: decision tree classifier, random forest, Artificial Neural Network (ANN), and logistic regression. First, I performed data pre-processing to clean up the dataset. In the next step, we performed function selection using the Select Best algorithm with an evaluation function of chi2 to perform hot coding. We then performed a split between training and testing and applied a machine learning algorithm. The metric used for comparison was accuracy. After implementing the algorithms, the accuracy was compared. Random forest was found to perform best at 89%. Afterwards, we performed hyperparameter tuning using a grid search algorithm on a random forest to obtain higher accuracy. The final accuracy is 90%. This kind of research can help improve health security policies by introducing modern computational techniques, and can also help optimize resources.
Collapse
Affiliation(s)
- S. Selvakumara Samy
- Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Tamilnadu 603203 India
| | - S. Karthick
- Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Tamilnadu 603203 India
| | - Meghna Ghosal
- Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Tamilnadu 603203 India
| | - Sameer Singh
- Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Tamilnadu 603203 India
| | - J. S. Sudarsan
- School of Energy and Environment, NICMAR University, 25/1, Balewadi, Pune, 411045 India
| | - S. Nithiyanantham
- Department of Physics, (Ultrasonic/NDT and Bio-Physics Divisions), Thiru. Vi. Kalyanasundaram Government Arts and Science College (Affiliated to Bharathidasan University, Thiruchirapalli), Thiruvarur, Tamilnadu 610003 India
| |
Collapse
|
5
|
Sinwar D, Dhaka VS, Tesfaye BA, Raghuwanshi G, Kumar A, Maakar SK, Agrawal S. Artificial Intelligence and Deep Learning Assisted Rapid Diagnosis of COVID-19 from Chest Radiographical Images: A Survey. CONTRAST MEDIA & MOLECULAR IMAGING 2022; 2022:1306664. [PMID: 36304775 PMCID: PMC9581633 DOI: 10.1155/2022/1306664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/06/2022] [Accepted: 09/27/2022] [Indexed: 01/26/2023]
Abstract
Artificial Intelligence (AI) has been applied successfully in many real-life domains for solving complex problems. With the invention of Machine Learning (ML) paradigms, it becomes convenient for researchers to predict the outcome based on past data. Nowadays, ML is acting as the biggest weapon against the COVID-19 pandemic by detecting symptomatic cases at an early stage and warning people about its futuristic effects. It is observed that COVID-19 has blown out globally so much in a short period because of the shortage of testing facilities and delays in test reports. To address this challenge, AI can be effectively applied to produce fast as well as cost-effective solutions. Plenty of researchers come up with AI-based solutions for preliminary diagnosis using chest CT Images, respiratory sound analysis, voice analysis of symptomatic persons with asymptomatic ones, and so forth. Some AI-based applications claim good accuracy in predicting the chances of being COVID-19-positive. Within a short period, plenty of research work is published regarding the identification of COVID-19. This paper has carefully examined and presented a comprehensive survey of more than 110 papers that came from various reputed sources, that is, Springer, IEEE, Elsevier, MDPI, arXiv, and medRxiv. Most of the papers selected for this survey presented candid work to detect and classify COVID-19, using deep-learning-based models from chest X-Rays and CT scan images. We hope that this survey covers most of the work and provides insights to the research community in proposing efficient as well as accurate solutions for fighting the pandemic.
Collapse
Affiliation(s)
- Deepak Sinwar
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, India
| | - Vijaypal Singh Dhaka
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, India
| | - Biniyam Alemu Tesfaye
- Department of Computer Science, College of Informatics, Bule Hora University, Bule Hora, Ethiopia
| | - Ghanshyam Raghuwanshi
- Department of Computer and Communication Engineering, Manipal University Jaipur, Jaipur, India
| | - Ashish Kumar
- Department of Mathematics and Statistics, Manipal University Jaipur, Jaipur, India
| | - Sunil Kr. Maakar
- School of Computing Science & Engineering, Galgotias University, Greater Noida, India
| | - Sanjay Agrawal
- Department of Electrical Engineering, Rajkiya Engineering College, Akbarpur, Ambedkar Nagar, India
| |
Collapse
|