1
|
Jiang J, Zhang C, Ke L, Hayes N, Zhu Y, Qiu H, Zhang B, Zhou T, Wei GW. A review of machine learning methods for imbalanced data challenges in chemistry. Chem Sci 2025:d5sc00270b. [PMID: 40271022 PMCID: PMC12013631 DOI: 10.1039/d5sc00270b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Accepted: 04/06/2025] [Indexed: 04/25/2025] Open
Abstract
Imbalanced data, where certain classes are significantly underrepresented in a dataset, is a widespread machine learning (ML) challenge across various fields of chemistry, yet it remains inadequately addressed. This data imbalance can lead to biased ML or deep learning (DL) models, which fail to accurately predict the underrepresented classes, thus limiting the robustness and applicability of these models. With the rapid advancement of ML and DL algorithms, several promising solutions to this issue have emerged, prompting the need for a comprehensive review of current methodologies. In this review, we examine the prominent ML approaches used to tackle the imbalanced data challenge in different areas of chemistry, including resampling techniques, data augmentation techniques, algorithmic approaches, and feature engineering strategies. Each of these methods is evaluated in the context of its application across various aspects of chemistry, such as drug discovery, materials science, cheminformatics, and catalysis. We also explore future directions for overcoming the imbalanced data challenge and emphasize data augmentation via physical models, large language models (LLMs), and advanced mathematics. The benefit of balanced data in new material design and production and the persistent challenges are discussed. Overall, this review aims to elucidate the prevalent ML techniques applied to mitigate the impacts of imbalanced data within the field of chemistry and offer insights into future directions for research and application.
Collapse
Affiliation(s)
- Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
| | - Chunhuan Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Nicole Hayes
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Huahai Qiu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Tianshou Zhou
- Key Laboratory of Computational Mathematics, Guangdong Province, School of Mathematics, Sun Yat-sen University Guangzhou 510006 P R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
- Department of Electrical and Computer Engineering, Michigan State University East Lansing Michigan 48824 USA
- Department of Biochemistry and Molecular Biology, Michigan State University East Lansing Michigan 48824 USA
| |
Collapse
|
2
|
Nath A, Chaube R. Mining Chemogenomic Spaces for Prediction of Drug-Target Interactions. Methods Mol Biol 2024; 2714:155-169. [PMID: 37676598 DOI: 10.1007/978-1-0716-3441-7_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
The pipeline of drug discovery consists of a number of processes; drug-target interaction determination is one of the salient steps among them. Computational prediction of drug-target interactions can facilitate in reducing the search space of experimental wet lab-based verifications steps, thus considerably reducing time and other resources dedicated to the drug discovery pipeline. While machine learning-based methods are more widespread for drug-target interaction prediction, network-centric methods are also evolving. In this chapter, we focus on the process of the drug-target interaction prediction from the perspective of using machine learning algorithms and the various stages involved for developing an accurate predictor.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, India
| | - Radha Chaube
- Department of Zoology, Institute of Science, Banaras Hindu University, Varanasi, India
| |
Collapse
|
3
|
Bao W, Gu Y, Chen B, Yu H. Golgi_DF: Golgi proteins classification with deep forest. Front Neurosci 2023; 17:1197824. [PMID: 37250391 PMCID: PMC10213405 DOI: 10.3389/fnins.2023.1197824] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 04/19/2023] [Indexed: 05/31/2023] Open
Abstract
Introduction Golgi is one of the components of the inner membrane system in eukaryotic cells. Its main function is to send the proteins involved in the synthesis of endoplasmic reticulum to specific parts of cells or secrete them outside cells. It can be seen that Golgi is an important organelle for eukaryotic cells to synthesize proteins. Golgi disorders can cause various neurodegenerative and genetic diseases, and the accurate classification of Golgi proteins is helpful to develop corresponding therapeutic drugs. Methods This paper proposed a novel Golgi proteins classification method, which is Golgi_DF with the deep forest algorithm. Firstly, the classified proteins method can be converted the vector features containing various information. Secondly, the synthetic minority oversampling technique (SMOTE) is utilized to deal with the classified samples. Next, the Light GBM method is utilized to feature reduction. Meanwhile, the features can be utilized in the penultimate dense layer. Therefore, the reconstructed features can be classified with the deep forest algorithm. Results In Golgi_DF, this method can be utilized to select the important features and identify Golgi proteins. Experiments show that the well-performance than the other art-of-the state methods. Golgi_DF as a standalone tools, all its source codes publicly available at https://github.com/baowz12345/golgiDF. Discussion Golgi_DF employed reconstructed feature to classify the Golgi proteins. Such method may achieve more available features among the UniRep features.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Yujian Gu
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Baitong Chen
- Department of Stomatology, Xuzhou First People’s Hospital, Xuzhou, China
- The Affiliated Hospital of China University of Mining and Technology, Xuzhou, China
| | - Huiping Yu
- Department of Neurosurgery, The Hospital of Joint Logistic, Quanzhou, China
| |
Collapse
|
4
|
Abstract
Background:
Bioluminescence is a unique and significant phenomenon in nature.
Bioluminescence is important for the lifecycle of some organisms and is valuable in biomedical
research, including for gene expression analysis and bioluminescence imaging technology. In recent
years, researchers have identified a number of methods for predicting bioluminescent proteins
(BLPs), which have increased in accuracy, but could be further improved.
Method:
In this study, a new bioluminescent proteins prediction method, based on a voting
algorithm, is proposed. Four methods of feature extraction based on the amino acid sequence were
used. 314 dimensional features in total were extracted from amino acid composition,
physicochemical properties and k-spacer amino acid pair composition. In order to obtain the highest
MCC value to establish the optimal prediction model, a voting algorithm was then used to build the
model. To create the best performing model, the selection of base classifiers and vote counting rules
are discussed.
Results:
The proposed model achieved 93.4% accuracy, 93.4% sensitivity and
91.7% specificity in the test set, which was better than any other method. A previous prediction of
bioluminescent proteins in three lineages was also improved using the model building method,
resulting in greatly improved accuracy.
Collapse
Affiliation(s)
- Shulin Zhao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba Science City, Japan
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
5
|
iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6664362. [PMID: 33505515 PMCID: PMC7808816 DOI: 10.1155/2021/6664362] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Revised: 12/13/2020] [Accepted: 12/28/2020] [Indexed: 02/07/2023]
Abstract
Bioluminescent proteins (BLPs) are a class of proteins that widely distributed in many living organisms with various mechanisms of light emission including bioluminescence and chemiluminescence from luminous organisms. Bioluminescence has been commonly used in various analytical research methods of cellular processes, such as gene expression analysis, drug discovery, cellular imaging, and toxicity determination. However, the identification of bioluminescent proteins is challenging as they share poor sequence similarities among them. In this paper, we briefly reviewed the development of the computational identification of BLPs and subsequently proposed a novel predicting framework for identifying BLPs based on eXtreme gradient boosting algorithm (XGBoost) and using sequence-derived features. To train the models, we collected BLP data from bacteria, eukaryote, and archaea. Then, for getting more effective prediction models, we examined the performances of different feature extraction methods and their combinations as well as classification algorithms. Finally, based on the optimal model, a novel predictor named iBLP was constructed to identify BLPs. The robustness of iBLP has been proved by experiments on training and independent datasets. Comparison with other published method further demonstrated that the proposed method is powerful and could provide good performance for BLP identification. The webserver and software package for BLP identification are freely available at http://lin-group.cn/server/iBLP.
Collapse
|
6
|
Nath A, Leier A. Improved cytokine-receptor interaction prediction by exploiting the negative sample space. BMC Bioinformatics 2020; 21:493. [PMID: 33129275 PMCID: PMC7603689 DOI: 10.1186/s12859-020-03835-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 10/23/2020] [Indexed: 01/19/2023] Open
Abstract
Background Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine–receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases—notably autoimmune, inflammatory and infectious diseases—and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. “Gold Standard” negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection. Results We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance. Conclusions A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections—with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, 492001, India.
| | - André Leier
- Department of Genetics, Department of Cell Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA.
| |
Collapse
|
7
|
Zhang D, Guan ZX, Zhang ZM, Li SH, Dao FY, Tang H, Lin H. Recent Development of Computational Predicting Bioluminescent Proteins. Curr Pharm Des 2020; 25:4264-4273. [PMID: 31696804 DOI: 10.2174/1381612825666191107100758] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2019] [Accepted: 11/04/2019] [Indexed: 12/22/2022]
Abstract
Bioluminescent Proteins (BLPs) are widely distributed in many living organisms that act as a key role of light emission in bioluminescence. Bioluminescence serves various functions in finding food and protecting the organisms from predators. With the routine biotechnological application of bioluminescence, it is recognized to be essential for many medical, commercial and other general technological advances. Therefore, the prediction and characterization of BLPs are significant and can help to explore more secrets about bioluminescence and promote the development of application of bioluminescence. Since the experimental methods are money and time-consuming for BLPs identification, bioinformatics tools have played important role in fast and accurate prediction of BLPs by combining their sequences information with machine learning methods. In this review, we summarized and compared the application of machine learning methods in the prediction of BLPs from different aspects. We wish that this review will provide insights and inspirations for researches on BLPs.
Collapse
Affiliation(s)
- Dan Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zheng-Xing Guan
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Hao Li
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hua Tang
- Department of Pathophysiology, Southwest Medical University, Luzhou 646000, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
8
|
Yadav A, Sahu R, Nath A. A representation transfer learning approach for enhanced prediction of growth hormone binding proteins. Comput Biol Chem 2020; 87:107274. [PMID: 32416563 DOI: 10.1016/j.compbiolchem.2020.107274] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 04/19/2020] [Accepted: 04/27/2020] [Indexed: 10/24/2022]
Abstract
Growth hormone binding proteins (GHBPs) are soluble proteins that play an important role in the modulation of signaling pathways pertaining to growth hormones. GHBPs are selective and bind non-covalently with growth hormones, but their functions are still not fully understood. Identification and characterization of GHBPs are the preliminary steps for understanding their roles in various cellular processes. As wet lab based experimental methods involve high cost and labor, computational methods can facilitate in narrowing down the search space of putative GHBPs. Performance of machine learning algorithms largely depends on the quality of features that it feeds on. Informative and non-redundant features generally result in enhanced performance and for this purpose feature selection algorithms are commonly used. In the present work, a novel representation transfer learning approach is presented for prediction of GHBPs. For their accurate prediction, deep autoencoder based features were extracted and subsequently SMO-PolyK classifier is trained. The prediction model is evaluated by both leave one out cross validation (LOOCV) and hold out independent testing set. On LOOCV, the prediction model achieved 89.8%% accuracy, with 89.4% sensitivity and 90.2% specificity and accuracy of 93.5%, sensitivity of 90.2% and specificity of 96.8% is attained on the hold out testing set. Further a comparison was made between the full set of sequence-based features, top performing sequence features extracted using feature selection algorithm, deep autoencoder based features and generalized low rank model based features on the prediction accuracy. Principal component analysis of the representative features along with t-sne visualization demonstrated the effectiveness of deep features in prediction of GHBPs. The present method is robust and accurate and may complement other wet lab based methods for identification of novel GHBPs.
Collapse
Affiliation(s)
- Amisha Yadav
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India
| | - Roopshikha Sahu
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India
| | - Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India.
| |
Collapse
|
9
|
Nath A, Sahu GK. Exploiting ensemble learning to improve prediction of phospholipidosis inducing potential. J Theor Biol 2019; 479:37-47. [PMID: 31310757 DOI: 10.1016/j.jtbi.2019.07.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Revised: 06/19/2019] [Accepted: 07/12/2019] [Indexed: 12/15/2022]
Abstract
Phospholipidosis is characterized by the presence of excessive accumulation of phospholipids in different tissue types (lungs, liver, eyes, kidneys etc.) caused by cationic amphiphilic drugs. Electron microscopy analysis has revealed the presence of lamellar inclusion bodies as the hallmark of phospholipidosis. Some phospholipidosis causing compounds can cause tissue specific inflammatory/retrogressive changes. Reliable and accurate in silico methods could facilitate early screening of phospholipidosis inducing compounds which can subsequently speed up the pharmaceutical drug discovery pipelines. In the present work, stacking ensembles are implemented for combining a number of different base learners to develop predictive models (a total of 256 trained machine learning models were tested) for phospholipidosis inducing compounds using a wide range of molecular descriptors (ChemMine, JOELib, Open babel and RDK descriptors) and structural alerts as input features. The best model consisting of stacked ensemble of machine learning algorithms with random forest as the second level learner outperformed other base and ensemble learners. JOELib descriptors along with structural alerts performed better than the other types of descriptor sets. The best ensemble model achieved an overall accuracy of 88.23%, sensitivity of 86.27%, specificity of 90.20%, mcc of 0.765, auc of 0.896 with 88.21 g-means. To assess the robustness and stability of the best ensemble model, it is further evaluated using stratified 10×10 fold cross validation and holdout testing sets (repeated 10 times) achieving 84.83% mean accuracy with 0.708 mean mcc and 88.46% mean accuracy with 0.771 mean mcc respectively. A comparison of different meta classifiers (Generalized linear regression, Gradient boosting machines, Random forest and Deep learning neural networks) in stacking ensemble revealed that random forest is the better choice for combining multiple classification models.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India.
| | - Gopal Krishna Sahu
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India
| |
Collapse
|
10
|
Nath A. Prediction and molecular insights into fungal adhesins and adhesin like proteins. Comput Biol Chem 2019; 80:333-340. [PMID: 31078912 DOI: 10.1016/j.compbiolchem.2019.05.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 04/02/2019] [Accepted: 05/02/2019] [Indexed: 10/26/2022]
Abstract
Adhesion is the foremost step in pathogenesis and biofilm formation and is facilitated by a special class of cell wall proteins known as adhesins. Formation of biofilms in catheters and other medical devices subsequently leads to infections. As compared to bacterial adhesins, there is relatively less work for the characterization and identification of fungal adhesins. Understanding the sequence characterization of fungal adhesins may facilitate a better understanding of its role in pathogenesis. Experimental methods for investigation and characterization of fungal adhesins are labor intensive and expensive. Therefore, there is a need for fast and efficient computational methods for the identification and characterization of fungal adhesins. The aim of the current study is twofold: (i) to develop an accurate predictor for fungal adhesins, (ii) to sieve out the prominent molecular signatures present in fungal adhesins. Of the many supervised learning algorithms implemented in the current study, voting ensembles resulted in enhanced prediction accuracy. The best voting-ensemble consisting of three support vector machines with three different kernels (PolyK, RBF, PuK) achieved an accuracy of 94.9% on leave one out cross validation and 98.0% accuracy on blind testing set. A preference/avoidance list of molecular features as well as human interpretable rules are also extracted giving insights into the general sequence features of fungal adhesins. Fungal adhesins are characterized by high Threonine and Cysteine and avoidance for Phenylalanine and Methionine. They also have avoidance for average hydrophilicity. The current analysis possibly will facilitate the understanding of the mechanism of fungal adhesin function which may further help in designing methods for restricting adhesin mediated pathogenesis.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, 492001, India.
| |
Collapse
|
11
|
Tiwari AK, Nath A, Subbiah K, Shukla KK. Enhanced Prediction for Observed Peptide Count in Protein Mass Spectrometry Data by Optimally Balancing the Training Dataset. INT J PATTERN RECOGN 2017. [DOI: 10.1142/s0218001417500409] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Imbalanced dataset affects the learning of classifiers. This imbalance problem is almost ubiquitous in biological datasets. Resampling is one of the common methods to deal with the imbalanced dataset problem. In this study, we explore the learning performance by varying the balancing ratios of training datasets, consisting of the observed peptides and absent peptides in the Mass Spectrometry experiment on the different machine learning algorithms. It has been observed that the ideal balancing ratio has yielded better performance than the imbalanced dataset, but it was not the best as compared to some intermediate ratio. By experimenting using Synthetic Minority Oversampling Technique (SMOTE) at different balancing ratios, we obtained the best results by achieving sensitivity of 92.1%, specificity value of 94.7%, overall accuracy of 93.4%, MCC of 0.869, and AUC of 0.982 with boosted random forest algorithm. This study also identifies the most discriminating features by applying the feature ranking algorithm. From the results of current experiments, it can be inferred that the performance of machine learning algorithms for the classification tasks can be enhanced by selecting optimally balanced training dataset, which can be obtained by suitably modifying the class distribution.
Collapse
Affiliation(s)
- Anoop Kumar Tiwari
- Department of Computer Science, Banaras Hindu University, Varanasi, India
| | - Abhigyan Nath
- Department of Computer Science, Banaras Hindu University, Varanasi, India
| | | | - Kaushal Kumar Shukla
- Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India
| |
Collapse
|
12
|
Zhang J, Chai H, Yang G, Ma Z. Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme. BMC Bioinformatics 2017; 18:294. [PMID: 28583090 PMCID: PMC5460367 DOI: 10.1186/s12859-017-1709-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Accepted: 05/25/2017] [Indexed: 11/10/2022] Open
Abstract
Background Bioluminescent proteins (BLPs) widely exist in many living organisms. As BLPs are featured by the capability of emitting lights, they can be served as biomarkers and easily detected in biomedical research, such as gene expression analysis and signal transduction pathways. Therefore, accurate identification of BLPs is important for disease diagnosis and biomedical engineering. In this paper, we propose a novel accurate sequence-based method named PredBLP (Prediction of BioLuminescent Proteins) to predict BLPs. Results We collect a series of sequence-derived features, which have been proved to be involved in the structure and function of BLPs. These features include amino acid composition, dipeptide composition, sequence motifs and physicochemical properties. We further prove that the combination of four types of features outperforms any other combinations or individual features. To remove potential irrelevant or redundant features, we also introduce Fisher Markov Selector together with Sequential Backward Selection strategy to select the optimal feature subsets. Additionally, we design a lineage-specific scheme, which is proved to be more effective than traditional universal approaches. Conclusion Experiment on benchmark datasets proves the robustness of PredBLP. We demonstrate that lineage-specific models significantly outperform universal ones. We also test the generalization capability of PredBLP based on independent testing datasets as well as newly deposited BLPs in UniProt. PredBLP is proved to be able to exceed many state-of-art methods. A web server named PredBLP, which implements the proposed method, is free available for academic use. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1709-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China.,School of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan Province, 464000, People's Republic of China
| | - Haiting Chai
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China
| | - Guifu Yang
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China
| | - Zhiqiang Ma
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin Province, 130117, People's Republic of China.
| |
Collapse
|
13
|
Nath A, Karthikeyan S. Enhanced identification of β-lactamases and its classes using sequence, physicochemical and evolutionary information with sequence feature characterization of the classes. Comput Biol Chem 2017; 68:29-38. [PMID: 28231526 DOI: 10.1016/j.compbiolchem.2017.02.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Revised: 01/19/2017] [Accepted: 02/10/2017] [Indexed: 01/24/2023]
Abstract
β-lactamases provides one of the most successful means of evading the therapeutic effects of β lactam class of antibiotics by many gram positive and gram negative bacteria. On the basis of sequence identity, β-lactamases have been identified into four distinct classes- A, B, C and D. The classes A, C and D are the serine β-lactamases and class B is the metallo-lactamse. In the present study, we developed a two stage cascade classification system. The first-stage performs the classification of β-lactamases from non-β-lactamases and the second-stage performs the further classification of β-lactamases into four different β-lactamase classes. In the first-stage binary classification, we obtained an accuracy of 97.3% with a sensitivity of 89.1% and specificity of 98.0% and for the second stage multi-class classification, we obtained an accuracy of 87.3% for the class A, 91.0% for the class B, 96.3% for the class C and 96.4% for class D. A systematic statistical analysis is carried out on the sieved-out, correctly-predicted instances from the second stage classifier, which revealed some interesting patterns. We analyzed different classes of β-lactamases on the basis of sequence and physicochemical property differences between them. Among amino acid composition, H, W, Y and V showed significant differences between the different β-lactamases classes. Differences in average physicochemical properties are observed for isoelectric point, volume, flexibility, hydrophobicity, bulkiness and charge in one or more β-lactamase classes. The key differences in physicochemical property groups can be observed in small and aromatic groups. Among amino acid property group n-grams except charged n-grams, all other property group n-grams are significant in one or more classes. Statistically significant differences in dipeptide counts among different β-lactamase classes are also reported.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Computer Science, Banaras Hindu University, Varanasi, 221005, India.
| | - S Karthikeyan
- Department of Computer Science, Banaras Hindu University, Varanasi, 221005, India.
| |
Collapse
|
14
|
Kim YJ, Heo J, Park KS, Kim S. Proposition of novel classification approach and features for improved real-time arrhythmia monitoring. Comput Biol Med 2016; 75:190-202. [PMID: 27318329 DOI: 10.1016/j.compbiomed.2016.06.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 06/03/2016] [Accepted: 06/07/2016] [Indexed: 12/13/2022]
Abstract
Arrhythmia refers to a group of conditions in which the heartbeat is irregular, fast, or slow due to abnormal electrical activity in the heart. Some types of arrhythmia such as ventricular fibrillation may result in cardiac arrest or death. Thus, arrhythmia detection becomes an important issue, and various studies have been conducted. Additionally, an arrhythmia detection algorithm for portable devices such as mobile phones has recently been developed because of increasing interest in e-health care. This paper proposes a novel classification approach and features, which are validated for improved real-time arrhythmia monitoring. The classification approach that was employed for arrhythmia detection is based on the concept of ensemble learning and the Taguchi method and has the advantage of being accurate and computationally efficient. The electrocardiography (ECG) data for arrhythmia detection was obtained from the MIT-BIH Arrhythmia Database (n=48). A novel feature, namely the heart rate variability calculated from 5s segments of ECG, which was not considered previously, was used. The novel classification approach and feature demonstrated arrhythmia detection accuracy of 89.13%. When the same data was classified using the conventional support vector machine (SVM), the obtained accuracy was 91.69%, 88.14%, and 88.74% for Gaussian, linear, and polynomial kernels, respectively. In terms of computation time, the proposed classifier was 5821.7 times faster than conventional SVM. In conclusion, the proposed classifier and feature showed performance comparable to those of previous studies, while the computational complexity and update interval were highly reduced.
Collapse
Affiliation(s)
- Yoon Jae Kim
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, Seoul 08826, Republic of Korea.
| | - Jeong Heo
- Interdisciplinary Program for Bioengineering, Graduate School, Seoul National University, Seoul 08826, Republic of Korea.
| | - Kwang Suk Park
- Department of Biomedical Engineering, Seoul National University College of Medicine, Seoul 03080, Republic of Korea; Institute of Medical and Biological Engineering, Medical Research Center, Seoul National University, Seoul 03080, Republic of Korea.
| | - Sungwan Kim
- Department of Biomedical Engineering, Seoul National University College of Medicine, Seoul 03080, Republic of Korea; Institute of Medical and Biological Engineering, Medical Research Center, Seoul National University, Seoul 03080, Republic of Korea.
| |
Collapse
|
15
|
Nath A, Subbiah K. Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors. 3 Biotech 2016; 6:93. [PMID: 28330163 PMCID: PMC4801844 DOI: 10.1007/s13205-016-0410-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 03/03/2016] [Indexed: 10/28/2022] Open
Abstract
To counter the host RNA silencing defense mechanism, many plant viruses encode RNA silencing suppressor proteins. These groups of proteins share very low sequence and structural similarities among them, which consequently hamper their annotation using sequence similarity-based search methods. Alternatively the machine learning-based methods can become a suitable choice, but the optimal performance through machine learning-based methods is being affected by various factors such as class imbalance, incomplete learning, selection of inappropriate features, etc. In this paper, we have proposed a novel approach to deal with the class imbalance problem by finding the optimal class distribution for enhancing the prediction accuracy for the RNA silencing suppressors. The optimal class distribution was obtained using different resampling techniques with varying degrees of class distribution starting from natural distribution to ideal distribution, i.e., equal distribution. The experimental results support the fact that optimal class distribution plays an important role to achieve near perfect learning. The best prediction results are obtained with Sequential Minimal Optimization (SMO) learning algorithm. We could achieve a sensitivity of 98.5 %, specificity of 92.6 % with an overall accuracy of 95.3 % on a tenfold cross validation and is further validated using leave one out cross validation test. It was also observed that the machine learning models trained on oversampled training sets using synthetic minority oversampling technique (SMOTE) have relatively performed better than on both randomly undersampled and imbalanced training data sets. Further, we have characterized the important discriminatory sequence features of RNA-silencing suppressors which distinguish these groups of proteins from other protein families.
Collapse
|