1
|
Matinyan S, Filipcik P, Abrahams JP. Deep learning applications in protein crystallography. Acta Crystallogr A Found Adv 2024; 80:1-17. [PMID: 38189437 PMCID: PMC10833361 DOI: 10.1107/s2053273323009300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 10/24/2023] [Indexed: 01/09/2024] Open
Abstract
Deep learning techniques can recognize complex patterns in noisy, multidimensional data. In recent years, researchers have started to explore the potential of deep learning in the field of structural biology, including protein crystallography. This field has some significant challenges, in particular producing high-quality and well ordered protein crystals. Additionally, collecting diffraction data with high completeness and quality, and determining and refining protein structures can be problematic. Protein crystallographic data are often high-dimensional, noisy and incomplete. Deep learning algorithms can extract relevant features from these data and learn to recognize patterns, which can improve the success rate of crystallization and the quality of crystal structures. This paper reviews progress in this field.
Collapse
Affiliation(s)
| | | | - Jan Pieter Abrahams
- Biozentrum, Basel University, Basel, Switzerland
- Paul Scherrer Institute, Villigen, Switzerland
| |
Collapse
|
2
|
Fakhar AZ, Liu J, Pajerowska-Mukhtar KM, Mukhtar MS. The Lost and Found: Unraveling the Functions of Orphan Genes. J Dev Biol 2023; 11:27. [PMID: 37367481 PMCID: PMC10299390 DOI: 10.3390/jdb11020027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/19/2023] [Accepted: 05/26/2023] [Indexed: 06/28/2023] Open
Abstract
Orphan Genes (OGs) are a mysterious class of genes that have recently gained significant attention. Despite lacking a clear evolutionary history, they are found in nearly all living organisms, from bacteria to humans, and they play important roles in diverse biological processes. The discovery of OGs was first made through comparative genomics followed by the identification of unique genes across different species. OGs tend to be more prevalent in species with larger genomes, such as plants and animals, and their evolutionary origins remain unclear but potentially arise from gene duplication, horizontal gene transfer (HGT), or de novo origination. Although their precise function is not well understood, OGs have been implicated in crucial biological processes such as development, metabolism, and stress responses. To better understand their significance, researchers are using a variety of approaches, including transcriptomics, functional genomics, and molecular biology. This review offers a comprehensive overview of the current knowledge of OGs in all domains of life, highlighting the possible role of dark transcriptomics in their evolution. More research is needed to fully comprehend the role of OGs in biology and their impact on various biological processes.
Collapse
Affiliation(s)
| | | | | | - M. Shahid Mukhtar
- Department of Biology, University of Alabama at Birmingham, 1300 University Blvd., Birmingham, AL 35294, USA
| |
Collapse
|
3
|
Patel CN, Mall R, Bensmail H. AI-driven drug repurposing and binding pose meta dynamics identifies novel targets for Monkeypox virus. J Infect Public Health 2023; 16:799-807. [PMID: 36966703 PMCID: PMC10014505 DOI: 10.1016/j.jiph.2023.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 02/28/2023] [Accepted: 03/05/2023] [Indexed: 03/17/2023] Open
Abstract
Monkeypox virus (MPXV) was confirmed in May 2022 and designated a global health emergency by WHO in July 2022. MPX virions are big, enclosed, brick-shaped, and contain a linear, double-stranded DNA genome as well as enzymes. MPXV particles bind to the host cell membrane via a variety of viral-host protein interactions. As a result, the wrapped structure is a potential therapeutic target. DeepRepurpose, an artificial intelligence-based compound-viral proteins interaction framework, was used via a transfer learning setting to prioritize a set of FDA approved and investigational drugs which can potentially inhibit MPXV viral proteins. To filter and narrow down the lead compounds from curated collections of pharmaceutical compounds, we used a rigorous computational framework that included homology modeling, molecular docking, dynamic simulations, binding free energy calculations, and binding pose metadynamics. We identified Elvitegravir as a potential inhibitor of MPXV virus using our comprehensive pipeline.
Collapse
Affiliation(s)
- Chirag N. Patel
- Department of Botany, Bioinformatics and Climate Change Impacts Management, School of Science, Gujarat University, Ahmedabad-380009, India,Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institute of Health, Frederick, MD-21702, USA
| | - Raghvendra Mall
- Department of Immunology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, Tennessee-38105, USA,Biotechnology Research Center, Technology Innovation Institute, Abu Dhabi-9639, United Arab Emirates,Corresponding author at: Department of Immunology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, Tennessee-38105, USA
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha-34110, Qatar,Corresponding author
| |
Collapse
|
4
|
Wang PH, Zhu YH, Yang X, Yu DJ. GCmapCrys: Integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction. Anal Biochem 2023; 663:115020. [PMID: 36521558 DOI: 10.1016/j.ab.2022.115020] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Revised: 12/05/2022] [Accepted: 12/10/2022] [Indexed: 12/14/2022]
Abstract
X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.
Collapse
Affiliation(s)
- Peng-Hao Wang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China
| | - Xibei Yang
- School of Computer, Jiangsu University of Science and Technology, Zhenjiang, 212100, PR China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China.
| |
Collapse
|
5
|
Wang S, Zhao H. SADeepcry: a deep learning framework for protein crystallization propensity prediction using self-attention and auto-encoder networks. Brief Bioinform 2022; 23:6678422. [PMID: 36037090 DOI: 10.1093/bib/bbac352] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Revised: 07/15/2022] [Accepted: 07/27/2022] [Indexed: 11/14/2022] Open
Abstract
The X-ray diffraction (XRD) technique based on crystallography is the main experimental method to analyze the three-dimensional structure of proteins. The production process of protein crystals on which the XRD technique relies has undergone multiple experimental steps, which requires a lot of manpower and material resources. In addition, studies have shown that not all proteins can form crystals under experimental conditions, and the success rate of the final crystallization of proteins is only <10%. Although some protein crystallization predictors have been developed, not many tools capable of predicting multi-stage protein crystallization propensity are available and the accuracy of these tools is not satisfactory. In this paper, we propose a novel deep learning framework, named SADeepcry, for predicting protein crystallization propensity. The framework can be used to estimate the three steps (protein material production, purification and crystallization) in protein crystallization experiments and the success rate of the final protein crystallization. SADeepcry uses the optimized self-attention and auto-encoder modules to extract sequence, structure and physicochemical features from the proteins. Compared with other state-of-the-art protein crystallization propensity prediction models, SADeepcry can obtain more complex global spatial long-distance dependence of protein sequence information. Our computational results show that SADeepcry has increased Matthews correlation coefficient and area under the curve, by 100.3% and 13.4%, respectively, over the DCFCrystal method on the benchmark dataset. The codes of SADeepcry are available at https://github.com/zhc940702/SADeepcry.
Collapse
Affiliation(s)
- Shaokai Wang
- David R. Cheriton School of Computer Science, University of Waterloo, 200 University Ave W, Waterloo, ON N2L 3G1, Canada
| | - Haochen Zhao
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| |
Collapse
|
6
|
Xiouras C, Cameli F, Quilló GL, Kavousanakis ME, Vlachos DG, Stefanidis GD. Applications of Artificial Intelligence and Machine Learning Algorithms to Crystallization. Chem Rev 2022; 122:13006-13042. [PMID: 35759465 DOI: 10.1021/acs.chemrev.2c00141] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Artificial intelligence and specifically machine learning applications are nowadays used in a variety of scientific applications and cutting-edge technologies, where they have a transformative impact. Such an assembly of statistical and linear algebra methods making use of large data sets is becoming more and more integrated into chemistry and crystallization research workflows. This review aims to present, for the first time, a holistic overview of machine learning and cheminformatics applications as a novel, powerful means to accelerate the discovery of new crystal structures, predict key properties of organic crystalline materials, simulate, understand, and control the dynamics of complex crystallization process systems, as well as contribute to high throughput automation of chemical process development involving crystalline materials. We critically review the advances in these new, rapidly emerging research areas, raising awareness in issues such as the bridging of machine learning models with first-principles mechanistic models, data set size, structure, and quality, as well as the selection of appropriate descriptors. At the same time, we propose future research at the interface of applied mathematics, chemistry, and crystallography. Overall, this review aims to increase the adoption of such methods and tools by chemists and scientists across industry and academia.
Collapse
Affiliation(s)
- Christos Xiouras
- Chemical Process R&D, Crystallization Technology Unit, Janssen R&D, Turnhoutseweg 30, 2340 Beerse, Belgium
| | - Fabio Cameli
- Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, Delaware 19716, United States
| | - Gustavo Lunardon Quilló
- Chemical Process R&D, Crystallization Technology Unit, Janssen R&D, Turnhoutseweg 30, 2340 Beerse, Belgium.,Chemical and BioProcess Technology and Control, Department of Chemical Engineering, Faculty of Engineering Technology, KU Leuven, Gebroeders de Smetstraat 1, 9000 Ghent, Belgium
| | - Mihail E Kavousanakis
- School of Chemical Engineering, National Technical University of Athens, Heroon Polytechniou 9, 15780 Zografou, Greece
| | - Dionisios G Vlachos
- Department of Chemical and Biomolecular Engineering, University of Delaware, 150 Academy Street, Newark, Delaware 19716, United States
| | - Georgios D Stefanidis
- School of Chemical Engineering, National Technical University of Athens, Heroon Polytechniou 9, 15780 Zografou, Greece.,Laboratory for Chemical Technology, Ghent University; Tech Lane Ghent Science Park 125, B-9052 Ghent, Belgium
| |
Collapse
|
7
|
Zhang X, Xuan J, Yao C, Gao Q, Wang L, Jin X, Li S. A deep learning approach for orphan gene identification in moso bamboo (Phyllostachys edulis) based on the CNN + Transformer model. BMC Bioinformatics 2022; 23:162. [PMID: 35513802 PMCID: PMC9069780 DOI: 10.1186/s12859-022-04702-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Accepted: 04/28/2022] [Indexed: 12/02/2022] Open
Abstract
Background Orphan gene play an important role in the environmental stresses of many species and their identification is a critical step to understand biological functions. Moso bamboo has high ecological, economic and cultural value. Studies have shown that the growth of moso bamboo is influenced by various stresses. Several traditional methods are time-consuming and inefficient. Hence, the development of efficient and high-accuracy computational methods for predicting orphan genes is of great significance. Results In this paper, we propose a novel deep learning model (CNN + Transformer) for identifying orphan genes in moso bamboo. It uses a convolutional neural network in combination with a transformer neural network to capture k-mer amino acids and features between k-mer amino acids in protein sequences. The experimental results show that the average balance accuracy value of CNN + Transformer on moso bamboo dataset can reach 0.875, and the average Matthews Correlation Coefficient (MCC) value can reach 0.471. For the same testing set, the Balance Accuracy (BA), Geometric Mean (GM), Bookmaker Informedness (BM), and MCC values of the recurrent neural network, long short-term memory, gated recurrent unit, and transformer models are all lower than those of CNN + Transformer, which indicated that the model has the extensive ability for OG identification in moso bamboo. Conclusions CNN + Transformer model is feasible and obtains the credible predictive results. It may also provide valuable references for other related research. As our knowledge, this is the first model to adopt the deep learning techniques for identifying orphan genes in plants. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04702-1.
Collapse
Affiliation(s)
- Xiaodan Zhang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China
| | - Jinxiang Xuan
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China
| | - Chensong Yao
- Graduate School, Anhui Agricultural University, Hefei, 230036, China
| | - Qijuan Gao
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China
| | - Lianglong Wang
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China.,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China
| | - Xiu Jin
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China. .,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China.
| | - Shaowen Li
- Anhui Province Key Laboratory of Smart Agricultural Technology and Equipment, Anhui Agriculture University, Hefei, 230001, China. .,College of Information and Computer Science, Anhui Agricultural University, Hefei, 230001, China.
| |
Collapse
|
8
|
Lin K, Quan X, Jin C, Shi Z, Yang J. An Interpretable Double-Scale Attention Model for Enzyme Protein Class Prediction Based on Transformer Encoders and Multi-Scale Convolutions. Front Genet 2022; 13:885627. [PMID: 35432476 PMCID: PMC9012241 DOI: 10.3389/fgene.2022.885627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 03/07/2022] [Indexed: 12/01/2022] Open
Abstract
Background Classification and annotation of enzyme proteins are fundamental for enzyme research on biological metabolism. Enzyme Commission (EC) numbers provide a standard for hierarchical enzyme class prediction, on which several computational methods have been proposed. However, most of these methods are dependent on prior distribution information and none explicitly quantifies amino-acid-level relations and possible contribution of sub-sequences. Methods In this study, we propose a double-scale attention enzyme class prediction model named DAttProt with high reusability and interpretability. DAttProt encodes sequence by self-supervised Transformer encoders in pre-training and gathers local features by multi-scale convolutions in fine-tuning. Specially, a probabilistic double-scale attention weight matrix is designed to aggregate multi-scale features and positional prediction scores. Finally, a full connection linear classifier conducts a final inference through the aggregated features and prediction scores. Results On DEEPre and ECPred datasets, DAttProt performs as competitive with the compared methods on level 0 and outperforms them on deeper task levels, reaching 0.788 accuracy on level 2 of DEEPre and 0.967 macro-F1 on level 1 of ECPred. Moreover, through case study, we demonstrate that the double-scale attention matrix learns to discover and focus on the positions and scales of bio-functional sub-sequences in the protein. Conclusion Our DAttProt provides an effective and interpretable method for enzyme class prediction. It can predict enzyme protein classes accurately and furthermore discover enzymatic functional sub-sequences such as protein motifs from both positional and spatial scales.
Collapse
Affiliation(s)
- Ken Lin
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Xiongwen Quan
- College of Artificial Intelligence, Nankai University, Tianjin, China
- *Correspondence: Xiongwen Quan,
| | - Chen Jin
- College of Computer Science, Nankai University, Tianjin, China
| | - Zhuangwei Shi
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Jinglong Yang
- College of Artificial Intelligence, Nankai University, Tianjin, China
| |
Collapse
|
9
|
Jin C, Shi Z, Kang C, Lin K, Zhang H. TLCrys: Transfer Learning Based Method for Protein Crystallization Prediction. Int J Mol Sci 2022; 23:972. [PMID: 35055158 PMCID: PMC8778968 DOI: 10.3390/ijms23020972] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 01/05/2022] [Accepted: 01/14/2022] [Indexed: 11/17/2022] Open
Abstract
X-ray diffraction technique is one of the most common methods of ascertaining protein structures, yet only 2-10% of proteins can produce diffraction-quality crystals. Several computational methods have been proposed so far to predict protein crystallization. Nevertheless, the current state-of-the-art computational methods are limited by the scarcity of experimental data. Thus, the prediction accuracy of existing models hasn't reached the ideal level. To address the problems above, we propose a novel transfer-learning-based framework for protein crystallization prediction, named TLCrys. The framework proceeds in two steps: pre-training and fine-tuning. The pre-training step adopts attention mechanism to extract both global and local information of the protein sequences. The representation learned from the pre-training step is regarded as knowledge to be transferred and fine-tuned to enhance the performance of crystalization prediction. During pre-training, TLCrys adopts a multi-task learning method, which not only improves the learning ability of protein encoding, but also enhances the robustness and generalization of protein representation. The multi-head self-attention layer guarantees that different levels of the protein representation can be extracted by the fine-tuned step. During transfer learning, the fine-tuning strategy used by TLCrys improves the task-specialized learning ability of the network. Our method outperforms all previous predictors significantly in five crystallization stages of prediction. Furthermore, the proposed methodology can be well generalized to other protein sequence classification tasks.
Collapse
Affiliation(s)
- Chen Jin
- College of Computer Science, Nankai University, Tianjin 300350, China; (C.J.); (C.K.)
| | - Zhuangwei Shi
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China; (Z.S.); (K.L.)
| | - Chuanze Kang
- College of Computer Science, Nankai University, Tianjin 300350, China; (C.J.); (C.K.)
| | - Ken Lin
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China; (Z.S.); (K.L.)
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tianjin 300350, China; (Z.S.); (K.L.)
| |
Collapse
|
10
|
Affiliation(s)
- Melanie Vollmar
- Diamond Light Source Ltd., Harwell Science & Innovation Campus, Didcot, UK
| | - Gwyndaf Evans
- Diamond Light Source Ltd., Harwell Science & Innovation Campus, Didcot, UK
- Rosalind Franklin Institute, Harwell Science & Innovation Campus, Didcot, UK
| |
Collapse
|
11
|
Ghadermarzi S, Krawczyk B, Song J, Kurgan L. XRRpred: Accurate Predictor of Crystal Structure Quality from Protein Sequence. Bioinformatics 2021; 37:4366-4374. [PMID: 34247234 DOI: 10.1093/bioinformatics/btab509] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 06/10/2021] [Accepted: 07/06/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION X-ray crystallography was used to produce nearly 90% of protein structures. These efforts were supported by numerous sequence-based tools that accurately predict crystallizable proteins. However, protein structures vary widely in their quality, typically measured with resolution and R-free. This impacts the ability to use these structures for some applications including rational drug design and molecular docking and motivates development of methods that accurately predict structure quality. RESULTS We introduce XRRpred, the first predictor of the resolution and R-free values from protein sequences. XRRpred relies on original sequence profiles, hand-crafted features, empirically selected and parametrized regressors, and modern resampling techniques. Using an independent test dataset, we show that XRRpred provides accurate predictions of resolution and R-free. We demonstrate that XRRpred's predictions correctly model relationship between the resolution and R-free and reproduce structure quality relations between structural classes of proteins. We also show that XRRpred significantly outperforms indirect alternative ways to predict the structure quality that include predictors of crystallization propensity and an alignment-based approach. XRRpred is available as a convenient webserver that allows batch predictions and offers informative visualization of the results. AVAILABILITY http://biomine.cs.vcu.edu/servers/XRRPred/.
Collapse
Affiliation(s)
- Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Bartosz Krawczyk
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
12
|
Sequence-Based Prediction of Transmembrane Protein Crystallization Propensity. Interdiscip Sci 2021; 13:693-702. [PMID: 34143353 DOI: 10.1007/s12539-021-00448-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Revised: 05/31/2021] [Accepted: 06/04/2021] [Indexed: 10/21/2022]
Abstract
Transmembrane proteins play a vital role in cell life activities. There are several techniques to determine transmembrane protein structures and X-ray crystallography is the primary methodology. However, due to the special properties of transmembrane proteins, it is still hard to determine their structures by X-ray crystallography technique. To reduce experimental consumption and improve experimental efficiency, it is of great significance to develop computational methods for predicting the crystallization propensity of transmembrane proteins. In this work, we proposed a sequence-based machine learning method, namely Prediction of TransMembrane protein Crystallization propensity (PTMC), to predict the propensity of transmembrane protein crystallization. First, we obtained several general sequence features and the specific encoded features of relative solvent accessibility and hydrophobicity. Second, feature selection was employed to filter out redundant and irrelevant features, and the optimal feature subset is composed of hydrophobicity, amino acid composition and relative solvent accessibility. Finally, we chose extreme gradient boosting by comparing with other several machine learning methods. Comparative results on the independent test set indicate that PTMC outperforms state-of-the-art sequence-based methods in terms of sensitivity, specificity, accuracy, Matthew's Correlation Coefficient (MCC) and Area Under the receiver operating characteristic Curve (AUC). In comparison with two competitors, Bcrystal and TMCrys, PTMC achieves an improvement by 0.132 and 0.179 for sensitivity, 0.014 and 0.127 for specificity, 0.037 and 0.192 for accuracy, 0.128 and 0.362 for MCC, and 0.027 and 0.125 for AUC, respectively.
Collapse
|
13
|
Wang Y, Li Z, Zhang Y, Ma Y, Huang Q, Chen X, Dai Z, Zou X. Performance improvement for a 2D convolutional neural network by using SSC encoding on protein-protein interaction tasks. BMC Bioinformatics 2021; 22:184. [PMID: 33845759 PMCID: PMC8042949 DOI: 10.1186/s12859-021-04111-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Accepted: 03/30/2021] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND The interactions of proteins are determined by their sequences and affect the regulation of the cell cycle, signal transduction and metabolism, which is of extraordinary significance to modern proteomics research. Despite advances in experimental technology, it is still expensive, laborious, and time-consuming to determine protein-protein interactions (PPIs), and there is a strong demand for effective bioinformatics approaches to identify potential PPIs. Considering the large amount of PPI data, a high-performance processor can be utilized to enhance the capability of the deep learning method and directly predict protein sequences. RESULTS We propose the Sequence-Statistics-Content protein sequence encoding format (SSC) based on information extraction from the original sequence for further performance improvement of the convolutional neural network. The original protein sequences are encoded in the three-channel format by introducing statistical information (the second channel) and bigram encoding information (the third channel), which can increase the unique sequence features to enhance the performance of the deep learning model. On predicting protein-protein interaction tasks, the results using the 2D convolutional neural network (2D CNN) with the SSC encoding method are better than those of the 1D CNN with one hot encoding. The independent validation of new interactions from the HIPPIE database (version 2.1 published on July 18, 2017) and the validation of directly predicted results by applying a molecular docking tool indicate the effectiveness of the proposed protein encoding improvement in the CNN model. CONCLUSION The proposed protein sequence encoding method is efficient at improving the capability of the CNN model on protein sequence-related tasks and may also be effective at enhancing the capability of other machine learning or deep learning methods. Prediction accuracy and molecular docking validation showed considerable improvement compared to the existing hot encoding method, indicating that the SSC encoding method may be useful for analyzing protein sequence-related tasks. The source code of the proposed methods is freely available for academic research at https://github.com/wangy496/SSC-format/ .
Collapse
Affiliation(s)
- Yang Wang
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Yanfei Zhang
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Yingjun Ma
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
| | - Qixing Huang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Xingyu Chen
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, 510006, People's Republic of China
| | - Zong Dai
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China
- Research Institute of Sun Yat-Sen University in Shenzhen, Shenzhen, 518000, People's Republic of China
| | - Xiaoyong Zou
- School of Chemistry, Sun Yat-Sen University, Guangzhou, 510275, People's Republic of China.
- Research Institute of Sun Yat-Sen University in Shenzhen, Shenzhen, 518000, People's Republic of China.
| |
Collapse
|
14
|
Xuan W, Liu N, Huang N, Li Y, Wang J. CLPred: a sequence-based protein crystallization predictor using BLSTM neural network. Bioinformatics 2021; 36:i709-i717. [PMID: 33381840 DOI: 10.1093/bioinformatics/btaa791] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/07/2020] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Determining the structures of proteins is a critical step to understand their biological functions. Crystallography-based X-ray diffraction technique is the main method for experimental protein structure determination. However, the underlying crystallization process, which needs multiple time-consuming and costly experimental steps, has a high attrition rate. To overcome this issue, a series of in silico methods have been developed with the primary aim of selecting the protein sequences that are promising to be crystallized. However, the predictive performance of the current methods is modest. RESULTS We propose a deep learning model, so-called CLPred, which uses a bidirectional recurrent neural network with long short-term memory (BLSTM) to capture the long-range interaction patterns between k-mers amino acids to predict protein crystallizability. Using sequence only information, CLPred outperforms the existing deep-learning predictors and a vast majority of sequence-based diffraction-quality crystals predictors on three independent test sets. The results highlight the effectiveness of BLSTM in capturing non-local, long-range inter-peptide interaction patterns to distinguish proteins that can result in diffraction-quality crystals from those that cannot. CLPred has been steadily improved over the previous window-based neural networks, which is able to predict crystallization propensity with high accuracy. CLPred can also be improved significantly if it incorporates additional features from pre-extracted evolutional, structural and physicochemical characteristics. The correctness of CLPred predictions is further validated by the case studies of Sox transcription factor family member proteins and Zika virus non-structural proteins. AVAILABILITY AND IMPLEMENTATION https://github.com/xuanwenjing/CLPred.
Collapse
Affiliation(s)
- Wenjing Xuan
- School of Computer Science and Engineering.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| | - Ning Liu
- School of Computer Science and Engineering
| | - Neng Huang
- School of Computer Science and Engineering
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA
| | - Jianxin Wang
- School of Computer Science and Engineering.,Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 410083, China
| |
Collapse
|
15
|
Mall R, Elbasir A, Almeer H, Islam Z, Kolatkar PR, Chawla S, Ullah E. A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity. Bioinformatics 2021; 37:2544-2555. [PMID: 33638345 PMCID: PMC8163000 DOI: 10.1093/bioinformatics/btab130] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 02/16/2021] [Accepted: 02/24/2021] [Indexed: 11/14/2022] Open
Abstract
Motivation A global effort is underway to identify compounds for the treatment of COVID-19. Since de novo compound design is an extremely long, time-consuming, and expensive process, efforts are underway to discover existing compounds that can be repurposed for COVID-19 and new viral diseases. Model We propose a machine learning representation framework that uses deep learning induced vector embeddings of compounds and viral proteins as features to predict compound-viral protein activity. The prediction model in-turn uses a consensus framework to rank approved compounds against viral proteins of interest. Results Our consensus framework achieves a highmean Pearson correlation of 0.916, mean R2 of 0.840 and a low mean squared error of 0.313 for the task of compound-viral protein activity prediction on an independent test set. As a use case, we identify a ranked list of 47 compounds common to three main proteins of SARS-COV-2 virus (PL-PRO, 3CL-PRO and Spike protein) as potential targets including 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigationalhuman compounds.We performadditional molecular docking simulations to demonstrate thatmajority of these compounds have low binding energies and thus high binding affinity with the potential to be effective against the SARS-COV-2 virus. Availability All the source code and data is available at: https://github.com/raghvendra5688/Drug-Repurposing and https://dx.doi.org/10.17632/8rrwnbcgmx.3. We also implemented a web-server at: https://machinelearning-protein.qcri.org/index.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Abdurrahman Elbasir
- ICT Division, College of Science and Engineering, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Hossam Almeer
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Zeyaul Islam
- Qatar Biomedical Research Institute, Hamad Bin Khalifa Univeristy, Doha, 34110, Qatar
| | - Prasanna R Kolatkar
- Qatar Biomedical Research Institute, Hamad Bin Khalifa Univeristy, Doha, 34110, Qatar
| | - Sanjay Chawla
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Ehsan Ullah
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| |
Collapse
|
16
|
Yu L, Liu F, Li Y, Luo J, Jing R. DeepT3_4: A Hybrid Deep Neural Network Model for the Distinction Between Bacterial Type III and IV Secreted Effectors. Front Microbiol 2021; 12:605782. [PMID: 33552038 PMCID: PMC7858263 DOI: 10.3389/fmicb.2021.605782] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Accepted: 01/04/2021] [Indexed: 01/17/2023] Open
Abstract
Gram-negative bacteria can deliver secreted proteins (also known as secreted effectors) directly into host cells through type III secretion system (T3SS), type IV secretion system (T4SS), and type VI secretion system (T6SS) and cause various diseases. These secreted effectors are heavily involved in the interactions between bacteria and host cells, so their identification is crucial for the discovery and development of novel anti-bacterial drugs. It is currently challenging to accurately distinguish type III secreted effectors (T3SEs) and type IV secreted effectors (T4SEs) because neither T3SEs nor T4SEs contain N-terminal signal peptides, and some of these effectors have similar evolutionary conserved profiles and sequence motifs. To address this challenge, we develop a deep learning (DL) approach called DeepT3_4 to correctly classify T3SEs and T4SEs. We generate amino-acid character dictionary and sequence-based features extracted from effector proteins and subsequently implement these features into a hybrid model that integrates recurrent neural networks (RNNs) and deep neural networks (DNNs). After training the model, the hybrid neural network classifies secreted effectors into two different classes with an accuracy, F-value, and recall of over 80.0%. Our approach stands for the first DL approach for the classification of T3SEs and T4SEs, providing a promising supplementary tool for further secretome studies.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Yizhou Li
- College of Cybersecurity, Sichuan University, Chengdu, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Runyu Jing
- College of Cybersecurity, Sichuan University, Chengdu, China
| |
Collapse
|
17
|
Elhadd T, Mall R, Bashir M, Palotti J, Fernandez-Luque L, Farooq F, Mohanadi DA, Dabbous Z, Malik RA, Abou-Samra AB. Artificial Intelligence (AI) based machine learning models predict glucose variability and hypoglycaemia risk in patients with type 2 diabetes on a multiple drug regimen who fast during ramadan (The PROFAST - IT Ramadan study). Diabetes Res Clin Pract 2020; 169:108388. [PMID: 32858096 DOI: 10.1016/j.diabres.2020.108388] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Accepted: 08/19/2020] [Indexed: 10/23/2022]
Abstract
OBJECTIVE To develop a machine-based algorithm from clinical and demographic data, physical activity and glucose variability to predict hyperglycaemic and hypoglycaemic excursions in patients with type 2 diabetes on multiple glucose lowering therapies who fast during Ramadan. PATIENTS AND METHODS Thirteen patients (10 males and three females) with type 2 diabetes on 3 or more anti-diabetic medications were studied with a Fitbit-2 pedometer device and Freestyle Libre (Abbott Diagnostics) 2 weeks before and 2 weeks during Ramadan. Several machine learning techniques were trained to predict blood glucose levels in a regression framework utilising physical activity and contemporaneous blood glucose levels, comparing Ramadan to non-Ramadan days. RESULTS The median age of participants was 51 years (IQR 49-52); median BMI was 33.2 kg/m2 (IQR 33.0-35.9) and median HbA1c was 7.3% (IQR 6.7-7.8). The optimal model using physical activity achieved an R2 of 0.548 and a mean absolute error (MAE) of 30.30. The addition of electronic health record (ehr) information increased R2 to 0.636 and reduced MAE to 26.89 and the time of the day feature further increased R2 to 0.768 and reduced MAE to 20.55. Combining all the features together resulted in an optimal XGBoost model with an R2 of 0.836 and MAE of 17.47. This model accurately estimated normal glucose levels in 2584/2715 (95.2%) readings and hyperglycaemic events in 852/1031 (82.6%) readings, but fewer hypoglycaemic events (48/172 (27.9%)). The optimal XGBoost model prioritized age, gender, BMI and HbA1c followed by glucose levels and physical activity. Interestingly, the blood glucose level prediction by our model was influenced by use of SGLT2i. CONCLUSION XGBoost, a machine learning AI algorithm achieves high predictive performance for normal and hyperglycaemic excursions, but has limited predictive value for hypoglycaemia in patients on multiple therapies who fast during Ramadan.
Collapse
Affiliation(s)
| | | | | | - Joao Palotti
- Qatar Computer Research Institute (QCRI), Doha, Qatar; Hamad Medical Corporation, Doha, Qatar; CSAIL, Massachusetts Institute of Technology, USA
| | | | - Faisal Farooq
- Qatar Computer Research Institute (QCRI), Doha, Qatar
| | | | | | | | | |
Collapse
|
18
|
Elbasir A, Mall R, Kunji K, Rawi R, Islam Z, Chuang GY, Kolatkar PR, Bensmail H. BCrystal: an interpretable sequence-based protein crystallization predictor. Bioinformatics 2020; 36:1429-1438. [PMID: 31603511 DOI: 10.1093/bioinformatics/btz762] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 09/19/2019] [Accepted: 10/08/2019] [Indexed: 02/01/2023] Open
Abstract
MOTIVATION X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. RESULTS In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew's correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew's correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. AVAILABILITY AND IMPLEMENTATION Our BCrystal webserver is at https://machinelearning-protein.qcri.org/ and source code is available at https://github.com/raghvendra5688/BCrystal. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdurrahman Elbasir
- ICT Division, College of Science and Engineering, Hamad Bin Khalifa University
| | - Raghvendra Mall
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Khalid Kunji
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Reda Rawi
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Zeyaul Islam
- Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University, Doha 34100, Qatar
| | - Gwo-Yu Chuang
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Prasanna R Kolatkar
- Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University, Doha 34100, Qatar
| | - Halima Bensmail
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| |
Collapse
|
19
|
Zhu YH, Hu J, Ge F, Li F, Song J, Zhang Y, Yu DJ. Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features. Brief Bioinform 2020; 22:5839971. [PMID: 32436937 DOI: 10.1093/bib/bbaa076] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 04/09/2020] [Accepted: 04/13/2020] [Indexed: 11/13/2022] Open
Abstract
X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.
Collapse
|
20
|
Basith S, Manavalan B, Hwan Shin T, Lee G. Machine intelligence in peptide therapeutics: A next‐generation tool for rapid disease screening. Med Res Rev 2020; 40:1276-1314. [DOI: 10.1002/med.21658] [Citation(s) in RCA: 139] [Impact Index Per Article: 34.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 11/26/2019] [Accepted: 12/16/2019] [Indexed: 12/12/2022]
Affiliation(s)
- Shaherin Basith
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| | | | - Tae Hwan Shin
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| | - Gwang Lee
- Department of PhysiologyAjou University School of MedicineSuwon Republic of Korea
| |
Collapse
|
21
|
Li F, Chen J, Leier A, Marquez-Lago T, Liu Q, Wang Y, Revote J, Smith AI, Akutsu T, Webb GI, Kurgan L, Song J. DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites. Bioinformatics 2019; 36:1057-1065. [PMID: 31566664 PMCID: PMC8215920 DOI: 10.1093/bioinformatics/btz721] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Revised: 08/13/2019] [Accepted: 09/25/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the 'life and death' cellular processes, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases' functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events. RESULTS We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites. AVAILABILITY AND IMPLEMENTATION The DeepCleave webserver and source code are freely available at http://deepcleave.erc.monash.edu/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - André Leier
- Department of Genetics, USA,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Tatiana Marquez-Lago
- Department of Genetics, USA,Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Yanze Wang
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Jerico Revote
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | | | | |
Collapse
|
22
|
Park H, Mall R, Alharbi FH, Sanvito S, Tabet N, Bensmail H, El-Mellouhi F. Learn-and-Match Molecular Cations for Perovskites. J Phys Chem A 2019; 123:7323-7334. [DOI: 10.1021/acs.jpca.9b06208] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Heesoo Park
- Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, P.O. Box 34110, Doha, Qatar
| | - Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Fahhad H. Alharbi
- Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, P.O. Box 34110, Doha, Qatar
| | - Stefano Sanvito
- School of Physics, AMBER and CRANN Institute, Trinity College, Dublin 2, Ireland
| | - Nouar Tabet
- Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, P.O. Box 34110, Doha, Qatar
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Fedwa El-Mellouhi
- Qatar Environment and Energy Research Institute, Hamad Bin Khalifa University, P.O. Box 34110, Doha, Qatar
| |
Collapse
|
23
|
Palotti J, Mall R, Aupetit M, Rueschman M, Singh M, Sathyanarayana A, Taheri S, Fernandez-Luque L. Benchmark on a large cohort for sleep-wake classification with machine learning techniques. NPJ Digit Med 2019; 2:50. [PMID: 31304396 PMCID: PMC6555808 DOI: 10.1038/s41746-019-0126-9] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2019] [Accepted: 05/06/2019] [Indexed: 11/17/2022] Open
Abstract
Accurately measuring sleep and its quality with polysomnography (PSG) is an expensive task. Actigraphy, an alternative, has been proven cheap and relatively accurate. However, the largest experiments conducted to date, have had only hundreds of participants. In this work, we processed the data of the recently published Multi-Ethnic Study of Atherosclerosis (MESA) Sleep study to have both PSG and actigraphy data synchronized. We propose the adoption of this publicly available large dataset, which is at least one order of magnitude larger than any other dataset, to systematically compare existing methods for the detection of sleep-wake stages, thus fostering the creation of new algorithms. We also implemented and compared state-of-the-art methods to score sleep-wake stages, which range from the widely used traditional algorithms to recent machine learning approaches. We identified among the traditional algorithms, two approaches that perform better than the algorithm implemented by the actigraphy device used in the MESA Sleep experiments. The performance, in regards to accuracy and F 1 score of the machine learning algorithms, was also superior to the device's native algorithm and comparable to human annotation. Future research in developing new sleep-wake scoring algorithms, in particular, machine learning approaches, will be highly facilitated by the cohort used here. We exemplify this potential by showing that two particular deep-learning architectures, CNN and LSTM, among the many recently created, can achieve accuracy scores significantly higher than other methods for the same tasks.
Collapse
Affiliation(s)
- Joao Palotti
- Qatar Computing Research Institute, HBKU, Doha, Qatar
| | | | | | - Michael Rueschman
- Brigham and Women’s Hospital, Boston, MA USA
- Harvard University, Boston, MA USA
| | | | | | | | | |
Collapse
|