1
|
Li H, Li X, Li Y, Gao G, Wen K, Li Z, Zhang Y, Xiong B. Exploration of Alloying Elements of High Specific Modulus Al-Li Alloy Based on Machine Learning. MATERIALS (BASEL, SWITZERLAND) 2023; 17:92. [PMID: 38203946 PMCID: PMC10779854 DOI: 10.3390/ma17010092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 12/07/2023] [Accepted: 12/13/2023] [Indexed: 01/12/2024]
Abstract
In the aerospace sector, the development of lightweight aircraft heavily relies on the utilization of advanced aluminum-lithium alloys as primary structural materials. This study introduces an investigation aimed at optimizing the composition of an Al-2.32Li-1.44Cu-2.78Mg-0.3Ag-0.3Mn-0.1Zr alloy. The optimization process involves the selection of alloying elements through the application of machine learning techniques, with a focus on expected improvements in the specific modulus of these alloys. Expanding upon the optimization of the benchmark alloy's components, a more generalized modulus prediction model for Al-Li alloys was formulated. This model was then employed to evaluate the anticipated specific modulus of alloys within a virtual search space, encompassing substitutional elements. The study proceeded to validate six Al-Li alloys with a notably high potential for achieving an improved specific modulus. The results revealed that an alloy incorporating 0.96 wt.% of Ga as a substitutional element exhibited the most favorable microstructure. This alloy demonstrated optimal tensile strength (523 MPa) and specific modulus (31.531 GPa/(g·cm-3)), closely resembling that of the benchmark alloy. This research offers valuable insights into the application of compositional optimization to enhance the mechanical properties of Al-Li alloys. It emphasizes the significance of selecting alloying elements based on considerations such as their solid solubility thresholds and the expected enhancement of the specific modulus in Al-Li alloys.
Collapse
Affiliation(s)
- Huiyu Li
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- GRIMAT Engineering Institute Co., Ltd., Beijing 101407, China
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Xiwu Li
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- GRIMAT Engineering Institute Co., Ltd., Beijing 101407, China
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Yanan Li
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- GRIMAT Engineering Institute Co., Ltd., Beijing 101407, China
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Guanjun Gao
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- GRIMAT Engineering Institute Co., Ltd., Beijing 101407, China
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Kai Wen
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- GRIMAT Engineering Institute Co., Ltd., Beijing 101407, China
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Zhihui Li
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Yongan Zhang
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- GRIMAT Engineering Institute Co., Ltd., Beijing 101407, China
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| | - Baiqing Xiong
- State Key Laboratory of Nonferrous Metals and Processes, China GRINM Group Co., Ltd., Beijing 100088, China (G.G.); (K.W.); (Z.L.); (Y.Z.); (B.X.)
- General Research Institute for Nonferrous Metals, Beijing 100088, China
| |
Collapse
|
2
|
Peng C, Wu X, Yuan W, Zhang X, Zhang Y, Li Y. MGRFE: Multilayer Recursive Feature Elimination Based on an Embedded Genetic Algorithm for Cancer Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:621-632. [PMID: 31180870 DOI: 10.1109/tcbb.2019.2921961] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Microarray gene expression data have become a topic of great interest for cancer classification and for further research in the field of bioinformatics. Nonetheless, due to the "large p, small n" paradigm of limited biosamples and high-dimensional data, gene selection is becoming a demanding task, which is aimed at selecting a minimal number of discriminatory genes associated closely with a phenotype. Feature or gene selection is still a challenging problem owing to its nondeterministic polynomial time complexity and thus most of the existing feature selection algorithms utilize heuristic rules. A multilayer recursive feature elimination method based on an embedded integer-coded genetic algorithm, MGRFE, is proposed here, which is aimed at selecting the gene combination with minimal size and maximal information. On the basis of 19 benchmark microarray datasets including multiclass and imbalanced datasets, MGRFE outperforms state-of-the-art feature selection algorithms with better cancer classification accuracy and a smaller selected gene number. MGRFE could be regarded as a promising feature selection method for high-dimensional datasets especially gene expression data. Moreover, the genes selected by MGRFE have close biological relevance to cancer phenotypes. The source code of our proposed algorithm and all the 19 datasets used in this paper are available at https://github.com/Pengeace/MGRFE-GaRFE.
Collapse
|
3
|
Guo P, Luo Y, Mai G, Zhang M, Wang G, Zhao M, Gao L, Li F, Zhou F. Gene expression profile based classification models of psoriasis. Genomics 2014; 103:48-55. [PMID: 24239985 DOI: 10.1016/j.ygeno.2013.11.001] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Revised: 08/13/2013] [Accepted: 11/01/2013] [Indexed: 02/05/2023]
Abstract
Psoriasis is an autoimmune disease, which symptoms can significantly impair the patient's life quality. It is mainly diagnosed through the visual inspection of the lesion skin by experienced dermatologists. Currently no cure for psoriasis is available due to limited knowledge about its pathogenesis and development mechanisms. Previous studies have profiled hundreds of differentially expressed genes related to psoriasis, however with no robust psoriasis prediction model available. This study integrated the knowledge of three feature selection algorithms that revealed 21 features belonging to 18 genes as candidate markers. The final psoriasis classification model was established using the novel Incremental Feature Selection algorithm that utilizes only 3 features from 2 unique genes, IGFL1 and C10orf99. This model has demonstrated highly stable prediction accuracy (averaged at 99.81%) over three independent validation strategies. The two marker genes, IGFL1 and C10orf99, were revealed as the upstream components of growth signal transduction pathway of psoriatic pathogenesis.
Collapse
Affiliation(s)
- Pi Guo
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Department of Public Health, Shantou University Medical College, No. 22 Xinling Road, Shantou, Guangdong 515041, PR China
| | - Youxi Luo
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China
| | - Guoqin Mai
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China
| | - Ming Zhang
- Department of Epidemiology and Biostatistics, Faculty of Infectious Diseases, University of Georgia, Athens, GA 30605, USA; Institute of Bioinformatics, University of Georgia, Athens, GA 30605, USA
| | - Guoqing Wang
- Department of Pathogeny Biology, Norman Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China; Department of Pathogeny Biology, Norman Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China
| | - Miaomiao Zhao
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China
| | - Liming Gao
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China
| | - Fan Li
- Department of Pathogeny Biology, Norman Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China; Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Changchun, Jilin 130021, PR China
| | - Fengfeng Zhou
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China; Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China.
| |
Collapse
|
4
|
CHEN ZHENXUE, CHANG FALIANG, LIU CHUNSHENG. CHINESE LICENSE PLATE RECOGNITION BASED ON HUMAN VISION ATTENTION MECHANISM. INT J PATTERN RECOGN 2013. [DOI: 10.1142/s0218001413500249] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
License plate recognition (LPR) is one of the most important elements affecting intelligent transportation systems. A number of LPR techniques have been proposed. Humans are good target recognition systems. In other words, humans easily recognize common objects. In this paper, the researchers present a novel method of recognizing Chinese license plates. The method is based on the Human Vision Attention Mechanism (HVAM) and uses Chinese license plates as the targets. The research consists of three stages. The first stage involved finding and identifying license plates in videos of moving vehicles. The second stage separated each license plate into the seven characters. In the third stage, the character recognizer extracted some salient features of Chinese characters and used a multi-stage classifier to recognize each character on the license plate. In the experiment locating license plates, 1176 images taken from various scenes and conditions were employed. The method failed to identify the license plates in only 27 of the images; resulting in a license plate location rate of success of 97.7%. In the experiment for identifying license characters, 1149 images were used, from which license plates had been successfully located. The method failed to identify the characters in 45 of these images giving a success rate of 96.1%. Combining the above two rates, the overall rate of success for our LPR is 93.9%.
Collapse
Affiliation(s)
- ZHENXUE CHEN
- Shandong University, School of Control Science and Engineering, Jinan 250061, P. R. China
| | - FALIANG CHANG
- Shandong University, School of Control Science and Engineering, Jinan 250061, P. R. China
| | - CHUNSHENG LIU
- Shandong University, School of Control Science and Engineering, Jinan 250061, P. R. China
| |
Collapse
|
5
|
Wang J, Chen L, Wang Y, Zhang J, Liang Y, Xu D. A computational systems biology study for understanding salt tolerance mechanism in rice. PLoS One 2013; 8:e64929. [PMID: 23762267 PMCID: PMC3676415 DOI: 10.1371/journal.pone.0064929] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2013] [Accepted: 04/19/2013] [Indexed: 01/22/2023] Open
Abstract
Salinity is one of the most common abiotic stresses in agriculture production. Salt tolerance of rice (Oryza sativa) is an important trait controlled by various genes. The mechanism of rice salt tolerance, currently with limited understanding, is of great interest to molecular breeding in improving grain yield. In this study, a gene regulatory network of rice salt tolerance is constructed using a systems biology approach with a number of novel computational methods. We developed an improved volcano plot method in conjunction with a new machine-learning method for gene selection based on gene expression data and applied the method to choose genes related to salt tolerance in rice. The results were then assessed by quantitative trait loci (QTL), co-expression and regulatory binding motif analysis. The selected genes were constructed into a number of network modules based on predicted protein interactions including modules of phosphorylation activity, ubiquity activity, and several proteinase activities such as peroxidase, aspartic proteinase, glucosyltransferase, and flavonol synthase. All of these discovered modules are related to the salt tolerance mechanism of signal transduction, ion pump, abscisic acid mediation, reactive oxygen species scavenging and ion sequestration. We also predicted the three-dimensional structures of some crucial proteins related to the salt tolerance QTL for understanding the roles of these proteins in the network. Our computational study sheds some new light on the mechanism of salt tolerance and provides a systems biology pipeline for studying plant traits in general.
Collapse
Affiliation(s)
- Juexin Wang
- College of Computer Science and Technology, Jilin University, Changchun, China
- Digital Biology Laboratory, Computer Science Department, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
| | - Liang Chen
- College of Computer Science and Technology, Jilin University, Changchun, China
| | - Yan Wang
- College of Computer Science and Technology, Jilin University, Changchun, China
| | - Jingfen Zhang
- Digital Biology Laboratory, Computer Science Department, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
| | - Yanchun Liang
- College of Computer Science and Technology, Jilin University, Changchun, China
- * E-mail: (YL); (DX)
| | - Dong Xu
- College of Computer Science and Technology, Jilin University, Changchun, China
- Digital Biology Laboratory, Computer Science Department, and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, United States of America
- * E-mail: (YL); (DX)
| |
Collapse
|
6
|
Li BK, Cong Y, Yang XG, Xue Y, Chen YZ. In silico prediction of spleen tyrosine kinase inhibitors using machine learning approaches and an optimized molecular descriptor subset generated by recursive feature elimination method. Comput Biol Med 2013; 43:395-404. [DOI: 10.1016/j.compbiomed.2013.01.015] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2012] [Revised: 12/31/2012] [Accepted: 01/21/2013] [Indexed: 11/16/2022]
|
7
|
Song L, Langfelder P, Horvath S. Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 2013; 14:5. [PMID: 23323760 PMCID: PMC3645958 DOI: 10.1186/1471-2105-14-5] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2012] [Accepted: 01/03/2013] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND Ensemble predictors such as the random forest are known to have superior accuracy but their black-box predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature. RESULTS Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a "thinned" ensemble predictor (involving few features) that retains excellent predictive accuracy. CONCLUSION RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, out-of-bag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.
Collapse
Affiliation(s)
- Lin Song
- Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, California, USA
| | | | | |
Collapse
|
8
|
Le Floch E, Guillemot V, Frouin V, Pinel P, Lalanne C, Trinchera L, Tenenhaus A, Moreno A, Zilbovicius M, Bourgeron T, Dehaene S, Thirion B, Poline JB, Duchesnay E. Significant correlation between a set of genetic polymorphisms and a functional brain network revealed by feature selection and sparse Partial Least Squares. Neuroimage 2012; 63:11-24. [PMID: 22781162 DOI: 10.1016/j.neuroimage.2012.06.061] [Citation(s) in RCA: 65] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Revised: 04/16/2012] [Accepted: 06/27/2012] [Indexed: 11/25/2022] Open
Abstract
Brain imaging is increasingly recognised as an intermediate phenotype to understand the complex path between genetics and behavioural or clinical phenotypes. In this context, a first goal is to propose methods to identify the part of genetic variability that explains some neuroimaging variability. Classical univariate approaches often ignore the potential joint effects that may exist between genes or the potential covariations between brain regions. In this paper, we propose instead to investigate an exploratory multivariate method in order to identify a set of Single Nucleotide Polymorphisms (SNPs) covarying with a set of neuroimaging phenotypes derived from functional Magnetic Resonance Imaging (fMRI). Recently, Partial Least Squares (PLS) regression or Canonical Correlation Analysis (CCA) have been proposed to analyse DNA and transcriptomics. Here, we propose to transpose this idea to the DNA vs. imaging context. However, in very high-dimensional settings like in imaging genetics studies, such multivariate methods may encounter overfitting issues. Thus we investigate the use of different strategies of regularisation and dimension reduction techniques combined with PLS or CCA to face the very high dimensionality of imaging genetics studies. We propose a comparison study of the different strategies on a simulated dataset first and then on a real dataset composed of 94 subjects, around 600,000 SNPs and 34 functional MRI lateralisation indexes computed from reading and speech comprehension contrast maps. We estimate the generalisability of the multivariate association with a cross-validation scheme and demonstrate the significance of this link, using a permutation procedure. Univariate selection appears to be necessary to reduce the dimensionality. However, the significant association uncovered by this two-step approach combining univariate filtering and L1-regularised PLS suggests that discovering meaningful genetic associations calls for a multivariate approach.
Collapse
Affiliation(s)
- Edith Le Floch
- Laboratoire de Neuroimagerie Assistée par Ordinateur, Neurospin Center, I2BM, DSV, CEA, Gif-sur-Yvette, France.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
|
10
|
|
11
|
Lv W, Xue Y. Prediction of acetylcholinesterase inhibitors and characterization of correlative molecular descriptors by machine learning methods. Eur J Med Chem 2010; 45:1167-72. [DOI: 10.1016/j.ejmech.2009.12.038] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2009] [Revised: 12/15/2009] [Accepted: 12/17/2009] [Indexed: 11/28/2022]
|
12
|
Tang ZQ, Lin HH, Zhang HL, Han LY, Chen X, Chen YZ. Prediction of functional class of proteins and peptides irrespective of sequence homology by support vector machines. Bioinform Biol Insights 2009; 1:19-47. [PMID: 20066123 PMCID: PMC2789692 DOI: 10.4137/bbi.s315] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented.
Collapse
Affiliation(s)
- Zhi Qun Tang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hong Huang Lin
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Hai Lei Zhang
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Lian Yi Han
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
| | - Xin Chen
- Department of Biotechnology, Zhejiang University, Hang Zhou, Zhejiang Province, P. R. China, 310029
| | - Yu Zong Chen
- Department of Pharmacy and Department of Computational Science, National University of Singapore, Republic of Singapore, 117543
- Shanghai Center for Bioinformatics Technology, Shanghai, P. R. China, 201203
| |
Collapse
|
13
|
Yang XG, Chen D, Wang M, Xue Y, Chen YZ. Prediction of antibacterial compounds by machine learning approaches. J Comput Chem 2009; 30:1202-11. [DOI: 10.1002/jcc.21148] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
|
14
|
Wang M, Yang XG, Xue Y. Identifying hERG Potassium Channel Inhibitors by Machine Learning Methods. ACTA ACUST UNITED AC 2008. [DOI: 10.1002/qsar.200810015] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
15
|
Shinoda K, Sugimoto M, Tomita M, Ishihama Y. Informatics for peptide retention properties in proteomic LC-MS. Proteomics 2008; 8:787-98. [PMID: 18214845 DOI: 10.1002/pmic.200700692] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Retention times in HPLC yield valuable information for the identification of various analytes and the prediction of peptide retention is useful for the identification of peptides/proteins in LC-MS-based proteomics. Informatics methods such as artificial neural networks and support vector machines capable of solving nonlinear problems made possible the accurate modeling of quantitative structure-retention relationships of peptides (including large polymers) up to 5 kDa to which classical linear models cannot be applied, as well as the proteome-wide prediction of peptide retention. Proteome-wide retention prediction and accurate mass-information facilitate the identification of peptides in complex proteomic samples. In this review, we address recent developments in solid informatics methods and their application to peptide-retention properties in 'bottom-up' shotgun proteomics. We also describe future prospects for the standardization and application of retention times.
Collapse
Affiliation(s)
- Kosaku Shinoda
- Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata, Japan
| | | | | | | |
Collapse
|
16
|
Paoli S, Jurman G, Albanese D, Merler S, Furlanello C. Integrating gene expression profiling and clinical data. Int J Approx Reason 2008. [DOI: 10.1016/j.ijar.2007.03.012] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
17
|
Li H, Yap CW, Ung CY, Xue Y, Li ZR, Han LY, Lin HH, Chen YZ. Machine learning approaches for predicting compounds that interact with therapeutic and ADMET related proteins. J Pharm Sci 2007; 96:2838-60. [PMID: 17786989 DOI: 10.1002/jps.20985] [Citation(s) in RCA: 46] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Computational methods for predicting compounds of specific pharmacodynamic and ADMET (absorption, distribution, metabolism, excretion and toxicity) property are useful for facilitating drug discovery and evaluation. Recently, machine learning methods such as neural networks and support vector machines have been explored for predicting inhibitors, antagonists, blockers, agonists, activators and substrates of proteins related to specific therapeutic and ADMET property. These methods are particularly useful for compounds of diverse structures to complement QSAR methods, and for cases of unavailable receptor 3D structure to complement structure-based methods. A number of studies have demonstrated the potential of these methods for predicting such compounds as substrates of P-glycoprotein and cytochrome P450 CYP isoenzymes, inhibitors of protein kinases and CYP isoenzymes, and agonists of serotonin receptor and estrogen receptor. This article is intended to review the strategies, current progresses and underlying difficulties in using machine learning methods for predicting these protein binders and as potential virtual screening tools. Algorithms for proper representation of the structural and physicochemical properties of compounds are also evaluated.
Collapse
Affiliation(s)
- H Li
- Bioinformatics and Drug Design Group, Department of Pharmacy and Department of Computational Science, National University of Singapore, Blk S16, Level 8, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | | | |
Collapse
|
18
|
Lin HH, Han LY, Yap CW, Xue Y, Liu XH, Zhu F, Chen YZ. Prediction of factor Xa inhibitors by machine learning methods. J Mol Graph Model 2007; 26:505-18. [PMID: 17418603 DOI: 10.1016/j.jmgm.2007.03.003] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2006] [Revised: 02/04/2007] [Accepted: 03/07/2007] [Indexed: 01/04/2023]
Abstract
Factor Xa (FXa) inhibitors have been explored as anticoagulants for treatment and prevention of thrombotic diseases. Molecular docking, pharmacophore, quantitative structure-activity relationships, and support vector machines (SVM) have been used for computer prediction of FXa inhibitors. These methods achieve promising prediction accuracies of 69-80% for FXa inhibitors and 85-99% for non-inhibitors. Prediction performance, particularly for inhibitors, may be further improved by exploring methods applicable to more diverse range of compounds and by using more appropriate set of molecular descriptors. We tested the capability of several machine learning methods (C4.5 decision tree, k-nearest neighbor, probabilistic neural network, and support vector machine) by using a much more diverse set of 1098 compounds (360 inhibitors and 738 non-inhibitors) than those in other studies. A feature selection method was used for selecting molecular descriptors appropriate for distinguishing FXa inhibitors and non-inhibitors. The prediction accuracies of these methods are 89.1-97.5% for FXa inhibitors and 92.3-98.1% for non-inhibitors. In particular, compared to other studies, support vector machine gives a substantially improved accuracy of 94.6% for FXa non-inhibitors and maintains a comparable accuracy of 98.1% for inhibitors, based-on a more rigorous test with more diverse range of compounds. Our study suggests that machine learning methods such as SVM are useful for facilitating the prediction of FXa inhibitors.
Collapse
Affiliation(s)
- H H Lin
- Bioinformatics and Drug Design Group, Department of Pharmacy, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
| | | | | | | | | | | | | |
Collapse
|
19
|
Han LY, Zheng CJ, Xie B, Jia J, Ma XH, Zhu F, Lin HH, Chen X, Chen YZ. Support vector machines approach for predicting druggable proteins: recent progress in its exploration and investigation of its usefulness. Drug Discov Today 2007; 12:304-13. [PMID: 17395090 DOI: 10.1016/j.drudis.2007.02.015] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2006] [Revised: 01/30/2007] [Accepted: 02/20/2007] [Indexed: 02/07/2023]
Abstract
Identification and validation of viable targets is an important first step in drug discovery and new methods, and integrated approaches are continuously explored to improve the discovery rate and exploration of new drug targets. An in silico machine learning method, support vector machines, has been explored as a new method for predicting druggable proteins from amino acid sequence independent of sequence similarity, thereby facilitating the prediction of druggable proteins that exhibit no or low homology to known targets.
Collapse
Affiliation(s)
- Lian Yi Han
- Bioinformatics and Drug Design Group, Department of Pharmacy, National University of Singapore, Blk Soc 1, Level 7, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Quah KH, Quek C. MCES: A Novel Monte Carlo Evaluative Selection Approach for Objective Feature Selections. ACTA ACUST UNITED AC 2007; 18:431-48. [PMID: 17385630 DOI: 10.1109/tnn.2006.887555] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Most recent research efforts on feature selection have focused mainly on classification task due to its popularity in the data-mining community. However, feature selection research in nonlinear system estimations has been very limited. Hence, it is reasonable to devise a feature selection approach that is computationally efficient on nonlinear system estimations context. A novel feature selection approach, the Monte Carlo evaluative selection (MCES), is proposed in this paper. MCES is an objective sampling method that derives a better estimation of the relevancy measure. The algorithm is objectively designed to be applicable to both classification and nonlinear regressive tasks. The MCES method has been demonstrated to perform well with four sets of experiments, consisting of two classification and two regressive tasks. The results demonstrate that the MCES method has following strong advantages: 1) ability to identify correlated and irrelevant features based on weight ranking, 2) application to both nonlinear system estimation and classification tasks, and 3) independence of the underlying induction algorithms used to derive the performance measures.
Collapse
Affiliation(s)
- Kian Hong Quah
- Centre for Computational Intelligence, Nanyang Technological University, School of Computer Engineering, Singapore 639798, Singapore
| | | |
Collapse
|
21
|
Gene Selection Using Wilcoxon Rank Sum Test and Support Vector Machine for Cancer Classification. COMPUTATIONAL INTELLIGENCE AND SECURITY 2007. [DOI: 10.1007/978-3-540-74377-4_7] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
22
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006. [DOI: 10.1186/1471-2105-7-3 pmid: 16398926] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.
Results
We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.
Conclusion
Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
|
23
|
Abstract
BACKGROUND Recursive Feature Elimination is a common and well-studied method for reducing the number of attributes used for further analysis or development of prediction models. The effectiveness of the RFE algorithm is generally considered excellent, but the primary obstacle in using it is the amount of computational power required. RESULTS Here we introduce a variant of RFE which employs ideas from simulated annealing. The goal of the algorithm is to improve the computational performance of recursive feature elimination by eliminating chunks of features at a time with as little effect on the quality of the reduced feature set as possible. The algorithm has been tested on several large gene expression data sets. The RFE algorithm is implemented using a Support Vector Machine to assist in identifying the least useful gene(s) to eliminate. CONCLUSION The algorithm is simple and efficient and generates a set of attributes that is very similar to the set produced by RFE.
Collapse
Affiliation(s)
- Yuanyuan Ding
- Computer & Information Science Department, The University of Mississippi, University, MS, USA
| | - Dawn Wilkins
- Computer & Information Science Department, The University of Mississippi, University, MS, USA
| |
Collapse
|
24
|
Han L, Cui J, Lin H, Ji Z, Cao Z, Li Y, Chen Y. Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity. Proteomics 2006; 6:4023-37. [PMID: 16791826 DOI: 10.1002/pmic.200500938] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.
Collapse
Affiliation(s)
- Lianyi Han
- Department of Computational Science, National University of Singapore, Singapore, Singapore
| | | | | | | | | | | | | |
Collapse
|
25
|
Sangurdekar DP, Srienc F, Khodursky AB. A classification based framework for quantitative description of large-scale microarray data. Genome Biol 2006; 7:R32. [PMID: 16626502 PMCID: PMC1557986 DOI: 10.1186/gb-2006-7-4-r32] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2005] [Revised: 01/25/2006] [Accepted: 03/15/2006] [Indexed: 11/12/2022] Open
Abstract
A new classification-based framework is presented that allows quantitative description of microarray data in terms of significance of co-expression within any gene group and condition-specific gene class activity. Genome-wide surveys of transcription depend on gene classifications for the purpose of data interpretation. We propose a new information-theoretical-based method to: assess significance of co-expression within any gene group; quantitatively describe condition-specific gene-class activity; and systematically evaluate conditions in terms of gene-class activity. We applied this technique to describe microarray data tracking Escherichia coli transcriptional responses to more than 30 chemical and physiological perturbations. We correlated the nature and breadth of the responses with the nature of perturbation, identified gene group proxies for the perturbation classes and quantitatively compared closely related physiological conditions.
Collapse
Affiliation(s)
- Dipen P Sangurdekar
- Department of Chemical Engineering and Materials Science, University of Minnesota, Saint Paul, MN 55108, USA
- Biotechnology Institute, University of Minnesota, Saint Paul, MN 55108, USA
| | - Friedrich Srienc
- Department of Chemical Engineering and Materials Science, University of Minnesota, Saint Paul, MN 55108, USA
- Biotechnology Institute, University of Minnesota, Saint Paul, MN 55108, USA
| | - Arkady B Khodursky
- Biotechnology Institute, University of Minnesota, Saint Paul, MN 55108, USA
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Saint Paul, MN 55108, USA
| |
Collapse
|
26
|
Merler S, Jurman G. Terminated Ramp-Support vector machines: a nonparametric data dependent kernel. Neural Netw 2006; 19:1597-611. [PMID: 16603338 DOI: 10.1016/j.neunet.2005.11.004] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2004] [Accepted: 11/28/2005] [Indexed: 11/17/2022]
Abstract
We propose a novel algorithm, Terminated Ramp-Support Vector Machines (TR-SVM), for classification and feature ranking purposes in the family of Support Vector Machines. The main improvement relies on the fact that the kernel is automatically determined by the training examples. It is built as a function of simple classifiers, generalized terminated ramp functions, obtained by separating oppositely labeled pairs of training points. The algorithm has a meaningful geometrical interpretation, and it is derived in the framework of Tikhonov regularization theory. Its unique free parameter is the regularization one, representing a trade-off between empirical error and solution complexity. Employing the equivalence between the proposed algorithm and two-layer networks, a theoretical bound on the generalization error is also derived, together with Vapnik-Chervonenkis dimension. Performances are tested on a number of synthetic and real data sets.
Collapse
|
27
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7:3. [PMID: 16398926 PMCID: PMC1363357 DOI: 10.1186/1471-2105-7-3] [Citation(s) in RCA: 1114] [Impact Index Per Article: 61.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2005] [Accepted: 01/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
Affiliation(s)
- Ramón Díaz-Uriarte
- Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain
| | - Sara Alvarez de Andrés
- Cytogenetics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernández Almagro 3, Madrid, 28029, Spain
| |
Collapse
|
28
|
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006. [PMID: 16398926 DOI: 10.1186/1471‐2105‐7‐3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. RESULTS We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. CONCLUSION Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.
Collapse
Affiliation(s)
- Ramón Díaz-Uriarte
- Bioinformatics Unit, Biotechnology Programme, Spanish National Cancer Centre (CNIO), Melchor Fernandez Almagro 3, Madrid, 28029, Spain.
| | | |
Collapse
|
29
|
Li H, Yap CW, Xue Y, Li ZR, Ung CY, Han LY, Chen YZ. Statistical learning approach for predicting specific pharmacodynamic, pharmacokinetic, or toxicological properties of pharmaceutical agents. Drug Dev Res 2005. [DOI: 10.1002/ddr.20044] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
30
|
Mao Y, Zhou XB, Pi DY, Sun YX, Wong ST. Parameters selection in gene selection using Gaussian kernel support vector machines by genetic algorithm. J Zhejiang Univ Sci B 2005; 6:961-73. [PMID: 16187409 PMCID: PMC1390438 DOI: 10.1631/jzus.2005.b0961] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables and small number of samples as well as its non-linearity. It is difficult to get satisfying results by using conventional linear statistical methods. Recursive feature elimination based on support vector machine (SVM RFE) is an effective algorithm for gene selection and cancer classification, which are integrated into a consistent framework. In this paper, we propose a new method to select parameters of the aforementioned algorithm implemented with Gaussian kernel SVMs as better alternatives to the common practice of selecting the apparently best parameters by using a genetic algorithm to search for a couple of optimal parameter. Fast implementation issues for this method are also discussed for pragmatic reasons. The proposed method was tested on two representative hereditary breast cancer and acute leukaemia datasets. The experimental results indicate that the proposed method performs well in selecting genes and achieves high classification accuracies with these genes.
Collapse
Affiliation(s)
- Yong Mao
- National Laboratory of Industrial Control Technology, Institute of Modern Control Engineering, Zhejiang University, Hangzhou 310027, China
- †E-mail:; ;
| | - Xiao-bo Zhou
- Harvard Center for Neurodegeneration and Repair, Harvard Medical School and Brigham and Women’s Hospital, Harvard Medical School, Harvard University, Boston, MA 02115, USA
- †E-mail:; ;
| | - Dao-ying Pi
- National Laboratory of Industrial Control Technology, Institute of Modern Control Engineering, Zhejiang University, Hangzhou 310027, China
- †E-mail:; ;
| | - You-xian Sun
- National Laboratory of Industrial Control Technology, Institute of Modern Control Engineering, Zhejiang University, Hangzhou 310027, China
| | - Stephen T.C. Wong
- Harvard Center for Neurodegeneration and Repair, Harvard Medical School and Brigham and Women’s Hospital, Harvard Medical School, Harvard University, Boston, MA 02115, USA
| |
Collapse
|
31
|
Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ. Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. ACTA ACUST UNITED AC 2005; 44:1630-8. [PMID: 15446820 DOI: 10.1021/ci049869h] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Statistical-learning methods have been developed for facilitating the prediction of pharmacokinetic and toxicological properties of chemical agents. These methods employ a variety of molecular descriptors to characterize structural and physicochemical properties of molecules. Some of these descriptors are specifically designed for the study of a particular type of properties or agents, and their use for other properties or agents might generate noise and affect the prediction accuracy of a statistical learning system. This work examines to what extent the reduction of this noise can improve the prediction accuracy of a statistical learning system. A feature selection method, recursive feature elimination (RFE), is used to automatically select molecular descriptors for support vector machines (SVM) prediction of P-glycoprotein substrates (P-gp), human intestinal absorption of molecules (HIA), and agents that cause torsades de pointes (TdP), a rare but serious side effect. RFE significantly reduces the number of descriptors for each of these properties thereby increasing the computational speed for their classification. The SVM prediction accuracies of P-gp and HIA are substantially increased and that of TdP remains unchanged by RFE. These prediction accuracies are comparable to those of earlier studies derived from a selective set of descriptors. Our study suggests that molecular feature selection is useful for improving the speed and, in some cases, the accuracy of statistical learning methods for the prediction of pharmacokinetic and toxicological properties of chemical agents.
Collapse
Affiliation(s)
- Y Xue
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | |
Collapse
|
32
|
Li H, Yap CW, Ung CY, Xue Y, Cao ZW, Chen YZ. Effect of Selection of Molecular Descriptors on the Prediction of Blood−Brain Barrier Penetrating and Nonpenetrating Agents by Statistical Learning Methods. J Chem Inf Model 2005; 45:1376-84. [PMID: 16180914 DOI: 10.1021/ci050135u] [Citation(s) in RCA: 112] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
The ability or inability of a drug to penetrate into the brain is a key consideration in drug design. Drugs for treating central nervous system (CNS) disorders need to be able to penetrate the blood-brain barrier (BBB). BBB nonpenetration is desirable for non-CNS-targeting drugs to minimize potential CNS-related side effects. Computational methods have been employed for the prediction of BBB-penetrating (BBB+) and -nonpenetrating (BBB-) agents at impressive accuracies of 75-92% and 60-80%, respectively. However, the majority of these studies give a substantially lower BBB- accuracy, and thus overall accuracy, than the BBB+ accuracy. This work examined whether proper selection of molecular descriptors can improve both the BBB- and the overall accuracies of statistical learning methods. The methods tested include logistic regression, linear discriminate analysis, k nearest neighbor, C4.5 decision tree, probabilistic neural network, and support vector machine. Molecular descriptors were selected by using a feature selection method, recursive feature elimination (RFE). Results by using 415 BBB+ and BBB- agents show that RFE substantially improves both the BBB- and the overall accuracy for all of the methods studied. This suggests that statistical learning methods combined with proper feature selection is potentially useful for facilitating a more balanced and improved prediction of BBB+ and BBB- agents.
Collapse
Affiliation(s)
- Hu Li
- Bioinformatics and Drug Design Group, Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, PR China
| | | | | | | | | | | |
Collapse
|
33
|
Furlanello C, Serafini M, Merler S, Jurman G. Semisupervised learning for molecular profiling. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:110-8. [PMID: 17044176 DOI: 10.1109/tcbb.2005.28] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Class prediction and feature selection are two learning tasks that are strictly paired in the search of molecular profiles from microarray data. Researchers have become aware how easy it is to incur a selection bias effect, and complex validation setups are required to avoid overly optimistic estimates of the predictive accuracy of the models and incorrect gene selections. This paper describes a semisupervised pattern discovery approach that uses the by-products of complete validation studies on experimental setups for gene profiling. In particular, we introduce the study of the patterns of single sample responses (sample-tracking profiles) to the gene selection process induced by typical supervised learning tasks in microarray studies. We originate sample-tracking profiles as the aggregated off-training evaluation of SVM models of increasing gene panel sizes. Genes are ranked by E-RFE, an entropy-based variant of the recursive feature elimination for support vector machines (RFE-SVM). A Dynamic Time Warping (DTW) algorithm is then applied to define a metric between sample-tracking profiles. An unsupervised clustering based on the DTW metric allows automating the discovery of outliers and of subtypes of different molecular profiles. Applications are described on synthetic data and in two gene expression studies.
Collapse
|
34
|
Xue Y, Yap CW, Sun LZ, Cao ZW, Wang JF, Chen YZ. Prediction of P-Glycoprotein Substrates by a Support Vector Machine Approach. ACTA ACUST UNITED AC 2004; 44:1497-505. [PMID: 15272858 DOI: 10.1021/ci049971e] [Citation(s) in RCA: 107] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
P-glycoproteins (P-gp) actively transport a wide variety of chemicals out of cells and function as drug efflux pumps that mediate multidrug resistance and limit the efficacy of many drugs. Methods for facilitating early elimination of potential P-gp substrates are useful for facilitating new drug discovery. A computational ensemble pharmacophore model has recently been used for the prediction of P-gp substrates with a promising accuracy of 63%. It is desirable to extend the prediction range beyond compounds covered by the known pharmacophore models. For such a purpose, a machine learning method, support vector machine (SVM), was explored for the prediction of P-gp substrates. A set of 201 chemical compounds, including 116 substrates and 85 nonsubstrates of P-gp, was used to train and test a SVM classification system. This SVM system gave a prediction accuracy of at least 81.2% for P-gp substrates based on two different evaluation methods, which is substantially improved against that obtained from the multiple-pharmacophore model. The prediction accuracy for nonsubstrates of P-gp is 79.2% using 5-fold cross-validation. These accuracies are slightly better than those obtained from other statistical classification methods, including k-nearest neighbor (k-NN), probabilistic neural networks (PNN), and C4.5 decision tree, that use the same sets of data and molecular descriptors. Our study indicates the potential of SVM in facilitating the prediction of P-gp substrates.
Collapse
Affiliation(s)
- Y Xue
- Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543
| | | | | | | | | | | |
Collapse
|
35
|
Furlanello C, Serafini M, Merler S, Jurman G. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinformatics 2003; 4:54. [PMID: 14604446 PMCID: PMC293475 DOI: 10.1186/1471-2105-4-54] [Citation(s) in RCA: 98] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2003] [Accepted: 11/06/2003] [Indexed: 11/10/2022] Open
Abstract
Background We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). Results With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Conclusions Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.
Collapse
|