1
|
Ghosh D, Chakraborty S, Kodamana H, Chakraborty S. Application of machine learning in understanding plant virus pathogenesis: trends and perspectives on emergence, diagnosis, host-virus interplay and management. Virol J 2022; 19:42. [PMID: 35264189 PMCID: PMC8905280 DOI: 10.1186/s12985-022-01767-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Accepted: 02/27/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Inclusion of high throughput technologies in the field of biology has generated massive amounts of data in the recent years. Now, transforming these huge volumes of data into knowledge is the primary challenge in computational biology. The traditional methods of data analysis have failed to carry out the task. Hence, researchers are turning to machine learning based approaches for the analysis of high-dimensional big data. In machine learning, once a model is trained with a training dataset, it can be applied on a testing dataset which is independent. In current times, deep learning algorithms further promote the application of machine learning in several field of biology including plant virology. MAIN BODY Plant viruses have emerged as one of the principal global threats to food security due to their devastating impact on crops and vegetables. The emergence of new viral strains and species help viruses to evade the concurrent preventive methods. According to a survey conducted in 2014, plant viruses are anticipated to cause a global yield loss of more than thirty billion USD per year. In order to design effective, durable and broad-spectrum management protocols, it is very important to understand the mechanistic details of viral pathogenesis. The application of machine learning enables precise diagnosis of plant viral diseases at an early stage. Furthermore, the development of several machine learning-guided bioinformatics platforms has primed plant virologists to understand the host-virus interplay better. In addition, machine learning has tremendous potential in deciphering the pattern of plant virus evolution and emergence as well as in developing viable control options. CONCLUSIONS Considering a significant progress in the application of machine learning in understanding plant virology, this review highlights an introductory note on machine learning and comprehensively discusses the trends and prospects of machine learning in the diagnosis of viral diseases, understanding host-virus interplay and emergence of plant viruses.
Collapse
Affiliation(s)
- Dibyendu Ghosh
- Molecular Virology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi, 110067 India
| | - Srija Chakraborty
- Department of Chemical Engineering, Indian Institute of Technology Delhi, New Delhi, 110016 India
| | - Hariprasad Kodamana
- Department of Chemical Engineering, Indian Institute of Technology Delhi, New Delhi, 110016 India
- School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi, 110016 India
| | - Supriya Chakraborty
- Molecular Virology Laboratory, School of Life Sciences, Jawaharlal Nehru University, New Delhi, 110067 India
| |
Collapse
|
2
|
Yakimovich A, Beaugnon A, Huang Y, Ozkirimli E. Labels in a haystack: Approaches beyond supervised learning in biomedical applications. PATTERNS (NEW YORK, N.Y.) 2021; 2:100383. [PMID: 34950904 PMCID: PMC8672145 DOI: 10.1016/j.patter.2021.100383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Recent advances in biomedical machine learning demonstrate great potential for data-driven techniques in health care and biomedical research. However, this potential has thus far been hampered by both the scarcity of annotated data in the biomedical domain and the diversity of the domain's subfields. While unsupervised learning is capable of finding unknown patterns in the data by design, supervised learning requires human annotation to achieve the desired performance through training. With the latter performing vastly better than the former, the need for annotated datasets is high, but they are costly and laborious to obtain. This review explores a family of approaches existing between the supervised and the unsupervised problem setting. The goal of these algorithms is to make more efficient use of the available labeled data. The advantages and limitations of each approach are addressed and perspectives are provided.
Collapse
Affiliation(s)
- Artur Yakimovich
- Roche Pharma International Informatics, Roche Products Limited, Welwyn Garden City, UK
| | - Anaël Beaugnon
- Roche Pharma International Informatics, Roche, Boulogne-Billancourt, France
| | - Yi Huang
- Roche Pharma International Informatics, Roche (China) Holding Ltd., Shanghai, China
| | - Elif Ozkirimli
- Roche Pharma International Informatics, F. Hoffmann-La Roche AG, Kaiseraugst, Switzerland
| |
Collapse
|
3
|
Di Grazia L, Aminpour M, Vezzetti E, Rezania V, Marcolin F, Tuszynski JA. A new method for protein characterization and classification using geometrical features for 3D face analysis: An example of tubulin structures. Proteins 2020; 89:e25993. [PMID: 32779779 DOI: 10.1002/prot.25993] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 07/22/2020] [Accepted: 07/26/2020] [Indexed: 11/12/2022]
Abstract
This article reports on the results of research aimed to translate biometric 3D face recognition concepts and algorithms into the field of protein biophysics in order to precisely and rapidly classify morphological features of protein surfaces. Both human faces and protein surfaces are free-forms and some descriptors used in differential geometry can be used to describe them applying the principles of feature extraction developed for computer vision and pattern recognition. The first part of this study focused on building the protein dataset using a simulation tool and performing feature extraction using novel geometrical descriptors. The second part tested the method on two examples, first involved a classification of tubulin isotypes and the second compared tubulin with the FtsZ protein, which is its bacterial analog. An additional test involved several unrelated proteins. Different classification methodologies have been used: a classic approach with a support vector machine (SVM) classifier and an unsupervised learning with a k-means approach. The best result was obtained with SVM and the radial basis function kernel. The results are significant and competitive with the state-of-the-art protein classification methods. This leads to a new methodological direction in protein structure analysis.
Collapse
Affiliation(s)
| | - Maral Aminpour
- Department of Physics, University of Alberta, Edmonton, Alberta, Canada
- Department of Oncology, University of Alberta, Edmonton, Alberta, Canada
| | | | - Vahid Rezania
- Department of Physical Sciences, MacEwan University, Edmonton, Alberta, Canada
| | | | - Jack Adam Tuszynski
- DIGEP, Politecnico di Torino, Torino, Italy
- Department of Physics, University of Alberta, Edmonton, Alberta, Canada
- Department of Oncology, University of Alberta, Edmonton, Alberta, Canada
| |
Collapse
|
4
|
|
5
|
Breitman MF, Domingos FM, Bagley JC, Wiederhecker HC, Ferrari TB, Cavalcante VH, Pereira AC, Abreu TL, De-Lima AKS, Morais CJ, Prette ACD, Silva IP, Mello RD, Carvalho G, Lima TM, Silva AA, Matias CA, Carvalho GC, Pantoja JA, Monteiro Gomes I, Paschoaletto IP, Rodrigues GF, Talarico ÂNV, Barreto-Lima AF, Colli GR. A New Species of Enyalius (Squamata, Leiosauridae) Endemic to the Brazilian Cerrado. HERPETOLOGICA 2018. [DOI: 10.1655/0018-0831.355] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Affiliation(s)
| | | | - Justin C. Bagley
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Tayná B. Ferrari
- Campus I, Universidade Cató lica de Brasília, Águas Claras, DF 71966-700, Brazil
| | | | - André C. Pereira
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - TarcÍSio L.S. Abreu
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Carlos J.S. Morais
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - Ana C.H. Del Prette
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Rodrigo De Mello
- Campus I, Universidade Cató lica de Brasília, Águas Claras, DF 71966-700, Brazil
| | - Gabriela Carvalho
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - Thiago M.De Lima
- Campus I, Universidade Cató lica de Brasília, Águas Claras, DF 71966-700, Brazil
| | - Anandha A. Silva
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Gabriel C. Carvalho
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - João A.L. Pantoja
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | | | | | - ÂNgela V.C. Talarico
- Campus I, Universidade Cató lica de Brasília, Águas Claras, DF 71966-700, Brazil
| | | | - Guarino R. Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| |
Collapse
|
6
|
Breitman MF, Domingos FM, Bagley JC, Wiederhecker HC, Ferrari TB, Cavalcante VH, Pereira AC, Abreu TL, De-Lima AKS, Morais CJ, del Prette AC, Silva IP, de Mello R, Carvalho G, de Lima TM, Silva AA, Matias CA, Carvalho GC, Pantoja JA, Gomes IM, Paschoaletto IP, Rodrigues GF, Talarico ÂV, Barreto-Lima AF, Colli GR. A New Species ofEnyalius(Squamata, Leiosauridae) Endemic to the Brazilian Cerrado. HERPETOLOGICA 2018. [DOI: 10.1655/herpetologica-d-17-00041.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Affiliation(s)
| | - Fabricius M.C.B. Domingos
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
- Instituto de Ciências Biológicas e da Saúde, Universidade Federal de Mato Grosso, Pontal do Araguaia, MT 78698-000, Brazil
| | - Justin C. Bagley
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
- Departamento de Zoologia e Botânica, Universidade Estadual Paulista, São José do Rio Preto, SP 15054-000, Brazil
| | - Helga C. Wiederhecker
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
- Campus I, Universidade Católica de Brasília, Águas Claras, DF 71966-700, Brazil
| | - Tayná B. Ferrari
- Campus I, Universidade Católica de Brasília, Águas Claras, DF 71966-700, Brazil
| | - Vitor H.G.L. Cavalcante
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
- Instituto Federal do Piauí, Teresina, PI 64000-040, Brazil
| | - André C. Pereira
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - Tarcísio L.S. Abreu
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Carlos J.S. Morais
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - Ana C.H. del Prette
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Rodrigo de Mello
- Campus I, Universidade Católica de Brasília, Águas Claras, DF 71966-700, Brazil
| | - Gabriela Carvalho
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - Thiago M. de Lima
- Campus I, Universidade Católica de Brasília, Águas Claras, DF 71966-700, Brazil
| | - Anandha A. Silva
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | - Gabriel C. Carvalho
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | - João A.L. Pantoja
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| | | | | | | | | | | | - Guarino R. Colli
- Departamento de Zoologia, Universidade de Brasília, Brasília, DF 70910-900, Brazil
| |
Collapse
|
7
|
|
8
|
Peikari M, Salama S, Nofech-Mozes S, Martel AL. A Cluster-then-label Semi-supervised Learning Approach for Pathology Image Classification. Sci Rep 2018; 8:7193. [PMID: 29739993 PMCID: PMC5940864 DOI: 10.1038/s41598-018-24876-0] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 04/11/2018] [Indexed: 01/25/2023] Open
Abstract
Completely labeled pathology datasets are often challenging and time-consuming to obtain. Semi-supervised learning (SSL) methods are able to learn from fewer labeled data points with the help of a large number of unlabeled data points. In this paper, we investigated the possibility of using clustering analysis to identify the underlying structure of the data space for SSL. A cluster-then-label method was proposed to identify high-density regions in the data space which were then used to help a supervised SVM in finding the decision boundary. We have compared our method with other supervised and semi-supervised state-of-the-art techniques using two different classification tasks applied to breast pathology datasets. We found that compared with other state-of-the-art supervised and semi-supervised methods, our SSL method is able to improve classification performance when a limited number of labeled data instances are made available. We also showed that it is important to examine the underlying distribution of the data space before applying SSL techniques to ensure semi-supervised learning assumptions are not violated by the data.
Collapse
Affiliation(s)
| | - Sherine Salama
- Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada
| | - Sharon Nofech-Mozes
- Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada
| | - Anne L Martel
- Medical Biophysics, University of Toronto, Toronto, Canada.,Physical Sciences, Sunnybrook Research Institute, Toronto, Canada
| |
Collapse
|
9
|
Liu F, Ma R, Tay CYA, Octavia S, Lan R, Chung HKL, Riordan SM, Grimm MC, Leong RW, Tanaka MM, Connor S, Zhang L. Genomic analysis of oral Campylobacter concisus strains identified a potential bacterial molecular marker associated with active Crohn's disease. Emerg Microbes Infect 2018; 7:64. [PMID: 29636463 PMCID: PMC5893538 DOI: 10.1038/s41426-018-0065-6] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 03/14/2018] [Accepted: 03/20/2018] [Indexed: 02/08/2023]
Abstract
Campylobacter concisus is an oral bacterium that is associated with inflammatory bowel disease (IBD) including Crohn's disease (CD) and ulcerative colitis (UC). C. concisus consists of two genomospecies (GS) and diverse strains. This study aimed to identify molecular markers to differentiate commensal and IBD-associated C. concisus strains. The genomes of 63 oral C. concisus strains isolated from patients with IBD and healthy controls were examined, of which 38 genomes were sequenced in this study. We identified a novel secreted enterotoxin B homologue, Csep1. The csep1 gene was found in 56% of GS2 C. concisus strains, presented in the plasmid pICON or the chromosome. A six-nucleotide insertion at the position 654-659 bp in csep1 (csep1-6bpi) was found. The presence of csep1-6bpi in oral C. concisus strains isolated from patients with active CD (47%, 7/15) was significantly higher than that in strains from healthy controls (0/29, P = 0.0002), and the prevalence of csep1-6bpi positive C. concisus strains was significantly higher in patients with active CD (67%, 4/6) as compared to healthy controls (0/23, P = 0.0006). Proteomics analysis detected the Csep1 protein. A csep1 gene hot spot in the chromosome of different C. concisus strains was found. The pICON plasmid was only found in GS2 strains isolated from the two relapsed CD patients with small bowel complications. This study reports a C. concisus molecular marker (csep1-6bpi) that is associated with active CD.
Collapse
Affiliation(s)
- Fang Liu
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia
| | - Rena Ma
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia
| | - Chin Yen Alfred Tay
- Helicobacter Research Laboratory, Marshall Centre for Infectious Diseases Research and Training, School of Pathology and Laboratory Medicine, University of Western Australia, Perth, WA, Australia
| | - Sophie Octavia
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia
| | - Ruiting Lan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia
| | - Heung Kit Leslie Chung
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia
| | - Stephen M Riordan
- Gastrointestinal and Liver Unit, Prince of Wales Hospital, University of New South Wales, Sydney, NSW, Australia
| | - Michael C Grimm
- St George and Sutherland Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Rupert W Leong
- Concord Hospital, University of New South Wales, Sydney, NSW, Australia
| | - Mark M Tanaka
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia
| | - Susan Connor
- Liverpool Hospital, University of New South Wales, Sydney, NSW, Australia
| | - Li Zhang
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, Australia.
| |
Collapse
|
10
|
Leveraging Big Data Tools and Technologies: Addressing the Challenges of the Water Quality Sector. SUSTAINABILITY 2017. [DOI: 10.3390/su9122160] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
11
|
Deepthi P, Thampi SM. Predicting cancer subtypes from microarray data using semi-supervised fuzzy C-means algorithm. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2017. [DOI: 10.3233/jifs-169222] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- P.S. Deepthi
- LBS Centre for Science and Technology, Trivandrum, Kerala, India; School of CS and IT, Indian Institute of Information Technology and Management – Kerala, Trivandrum, Kerala, India
| | - Sabu M. Thampi
- School of CS and IT, Indian Institute of Information Technology and Management – Kerala, Trivandrum, Kerala, India
| |
Collapse
|
12
|
K. K, P. G. L, Rangarajan L, K. AK. Effective Feature Selection for Classification of Promoter Sequences. PLoS One 2016; 11:e0167165. [PMID: 27978541 PMCID: PMC5158321 DOI: 10.1371/journal.pone.0167165] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 11/09/2016] [Indexed: 11/18/2022] Open
Abstract
Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.
Collapse
Affiliation(s)
- Kouser K.
- DoS in Computer Science, Mysore, India
| | | | | | - Acharya Kshitish K.
- Institute of Bioinformatics and Applied Biotechnology (IBAB), Biotech Park, Electronic City, Bengaluru (Bangalore), Karnataka state, India
- Shodhaka Life Sciences Pvt. Ltd., IBAB, Biotech Park, Bengaluru (Bangalore), Karnataka state, India
| |
Collapse
|
13
|
Hanif M, Hafeez A, Suleman Y, Mustafa Rafique M, Butt AR, Iqbal SM. An accelerated framework for the classification of biological targets from solid-state micropore data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 134:53-67. [PMID: 27480732 DOI: 10.1016/j.cmpb.2016.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 05/05/2016] [Accepted: 06/13/2016] [Indexed: 06/06/2023]
Abstract
Micro- and nanoscale systems have provided means to detect biological targets, such as DNA, proteins, and human cells, at ultrahigh sensitivity. However, these devices suffer from noise in the raw data, which continues to be significant as newer and devices that are more sensitive produce an increasing amount of data that needs to be analyzed. An important dimension that is often discounted in these systems is the ability to quickly process the measured data for an instant feedback. Realizing and developing algorithms for the accurate detection and classification of biological targets in realtime is vital. Toward this end, we describe a supervised machine-learning approach that records single cell events (pulses), computes useful pulse features, and classifies the future patterns into their respective types, such as cancerous/non-cancerous cells based on the training data. The approach detects cells with an accuracy of 70% from the raw data followed by an accurate classification when larger training sets are employed. The parallel implementation of the algorithm on graphics processing unit (GPU) demonstrates a speedup of three to four folds as compared to a serial implementation on an Intel Core i7 processor. This incredibly efficient GPU system is an effort to streamline the analysis of pulse data in an academic setting. This paper presents for the first time ever, a non-commercial technique using a GPU system for realtime analysis, paired with biological cluster targeting analysis.
Collapse
Affiliation(s)
- Madiha Hanif
- Nano-Bio Lab, University of Texas at Arlington, Arlington, TX 76019; Department of Bioengineering, University of Texas at Arlington, Arlington, TX 76019; Nanotechnology Research Center, University of Texas at Arlington, Arlington, TX 76019
| | - Abdul Hafeez
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24060
| | - Yusuf Suleman
- Nano-Bio Lab, University of Texas at Arlington, Arlington, TX 76019; Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019
| | | | - Ali R Butt
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24060
| | - Samir M Iqbal
- Nano-Bio Lab, University of Texas at Arlington, Arlington, TX 76019; Department of Bioengineering, University of Texas at Arlington, Arlington, TX 76019; Nanotechnology Research Center, University of Texas at Arlington, Arlington, TX 76019; Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX 76019; Department of Urology, University of Texas Southwestern Medical Center at Dallas, Dallas, TX 75390, USA.
| |
Collapse
|
14
|
Koyano H, Hayashida M, Akutsu T. Maximum margin classifier working in a set of strings. Proc Math Phys Eng Sci 2016; 472:20150551. [PMID: 27118908 PMCID: PMC4841474 DOI: 10.1098/rspa.2015.0551] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Accepted: 02/02/2016] [Indexed: 11/12/2022] Open
Abstract
Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.
Collapse
Affiliation(s)
- Hitoshi Koyano
- Laboratory of Biostatistics and Bioinformatics, Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan
| | - Morihiro Hayashida
- Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Tatsuya Akutsu
- Laboratory of Mathematical Bioinformatics, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
15
|
Stanescu A, Caragea D. An empirical study of ensemble-based semi-supervised learning approaches for imbalanced splice site datasets. BMC SYSTEMS BIOLOGY 2015; 9 Suppl 5:S1. [PMID: 26356316 PMCID: PMC4565116 DOI: 10.1186/1752-0509-9-s5-s1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Recent biochemical advances have led to inexpensive, time-efficient production of massive volumes of raw genomic data. Traditional machine learning approaches to genome annotation typically rely on large amounts of labeled data. The process of labeling data can be expensive, as it requires domain knowledge and expert involvement. Semi-supervised learning approaches that can make use of unlabeled data, in addition to small amounts of labeled data, can help reduce the costs associated with labeling. In this context, we focus on the problem of predicting splice sites in a genome using semi-supervised learning approaches. This is a challenging problem, due to the highly imbalanced distribution of the data, i.e., small number of splice sites as compared to the number of non-splice sites. To address this challenge, we propose to use ensembles of semi-supervised classifiers, specifically self-training and co-training classifiers. RESULTS Our experiments on five highly imbalanced splice site datasets, with positive to negative ratios of 1-to-99, showed that the ensemble-based semi-supervised approaches represent a good choice, even when the amount of labeled data consists of less than 1% of all training data. In particular, we found that ensembles of co-training and self-training classifiers that dynamically balance the set of labeled instances during the semi-supervised iterations show improvements over the corresponding supervised ensemble baselines. CONCLUSIONS In the presence of limited amounts of labeled data, ensemble-based semi-supervised approaches can successfully leverage the unlabeled data to enhance supervised ensembles learned from highly imbalanced data distributions. Given that such distributions are common for many biological sequence classification problems, our work can be seen as a stepping stone towards more sophisticated ensemble-based approaches to biological sequence annotation in a semi-supervised framework.
Collapse
Affiliation(s)
- Ana Stanescu
- Department of Computing and Information Sciences, Kansas State University, Nichols Hall, Manhattan, KS, 66506, USA
| | - Doina Caragea
- Department of Computing and Information Sciences, Kansas State University, Nichols Hall, Manhattan, KS, 66506, USA
| |
Collapse
|
16
|
Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience 2015; 14:350-359. [DOI: 10.1109/tnb.2015.2431292] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
17
|
|
18
|
Brayet J, Zehraoui F, Jeanson-Leh L, Israeli D, Tahi F. Towards a piRNA prediction using multiple kernel fusion and support vector machine. Bioinformatics 2015; 30:i364-70. [PMID: 25161221 PMCID: PMC4147894 DOI: 10.1093/bioinformatics/btu441] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Motivation: Piwi-interacting RNA (piRNA) is the most recently discovered and the least investigated class of Argonaute/Piwi protein-interacting small non-coding RNAs. The piRNAs are mostly known to be involved in protecting the genome from invasive transposable elements. But recent discoveries suggest their involvement in the pathophysiology of diseases, such as cancer. Their identification is therefore an important task, and computational methods are needed. However, the lack of conserved piRNA sequences and structural elements makes this identification challenging and difficult. Results: In the present study, we propose a new modular and extensible machine learning method based on multiple kernels and a support vector machine (SVM) classifier for piRNA identification. Very few piRNA features are known to date. The use of a multiple kernels approach allows editing, adding or removing piRNA features that can be heterogeneous in a modular manner according to their relevance in a given species. Our algorithm is based on a combination of the previously identified features [sequence features (k-mer motifs and a uridine at the first position) and piRNAs cluster feature] and a new telomere/centromere vicinity feature. These features are heterogeneous, and the kernels allow to unify their representation. The proposed algorithm, named piRPred, gives promising results on Drosophila and Human data and outscores previously published piRNA identification algorithms. Availability and implementation: piRPred is freely available to non-commercial users on our Web server EvryRNA http://EvryRNA.ibisc.univ-evry.fr Contact:tahi@ibisc.univ-evry.fr
Collapse
Affiliation(s)
- Jocelyn Brayet
- IBISC EA 4526, UEVE/Genopole, IBGBI, 23 bv. de France, 91000 Evry, France and Genethon, 1, bis rue de l'Internationale, 91002 Evry Cedex, France IBISC EA 4526, UEVE/Genopole, IBGBI, 23 bv. de France, 91000 Evry, France and Genethon, 1, bis rue de l'Internationale, 91002 Evry Cedex, France
| | - Farida Zehraoui
- IBISC EA 4526, UEVE/Genopole, IBGBI, 23 bv. de France, 91000 Evry, France and Genethon, 1, bis rue de l'Internationale, 91002 Evry Cedex, France
| | - Laurence Jeanson-Leh
- IBISC EA 4526, UEVE/Genopole, IBGBI, 23 bv. de France, 91000 Evry, France and Genethon, 1, bis rue de l'Internationale, 91002 Evry Cedex, France
| | - David Israeli
- IBISC EA 4526, UEVE/Genopole, IBGBI, 23 bv. de France, 91000 Evry, France and Genethon, 1, bis rue de l'Internationale, 91002 Evry Cedex, France
| | - Fariza Tahi
- IBISC EA 4526, UEVE/Genopole, IBGBI, 23 bv. de France, 91000 Evry, France and Genethon, 1, bis rue de l'Internationale, 91002 Evry Cedex, France
| |
Collapse
|
19
|
Yu G, Rangwala H, Domeniconi C, Zhang G, Zhang Z. Predicting Protein Function Using Multiple Kernels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:219-233. [PMID: 26357091 DOI: 10.1109/tcbb.2014.2351821] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
High-throughput experimental techniques provide a wide variety of heterogeneous proteomic data sources. To exploit the information spread across multiple sources for protein function prediction, these data sources are transformed into kernels and then integrated into a composite kernel. Several methods first optimize the weights on these kernels to produce a composite kernel, and then train a classifier on the composite kernel. As such, these approaches result in an optimal composite kernel, but not necessarily in an optimal classifier. On the other hand, some approaches optimize the loss of binary classifiers and learn weights for the different kernels iteratively. For multi-class or multi-label data, these methods have to solve the problem of optimizing weights on these kernels for each of the labels, which are computationally expensive and ignore the correlation among labels. In this paper, we propose a method called Predicting Protein Function using Multiple Kernels (ProMK). ProMK iteratively optimizes the phases of learning optimal weights and reduces the empirical loss of multi-label classifier for each of the labels simultaneously. ProMK can integrate kernels selectively and downgrade the weights on noisy kernels. We investigate the performance of ProMK on several publicly available protein function prediction benchmarks and synthetic datasets. We show that the proposed approach performs better than previously proposed protein function prediction approaches that integrate multiple data sources and multi-label multiple kernel learning methods. The codes of our proposed method are available at https://sites.google.com/site/guoxian85/promk.
Collapse
|
20
|
Chakraborty D, Maulik U. Identifying Cancer Biomarkers From Microarray Data Using Feature Selection and Semisupervised Learning. IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE-JTEHM 2014; 2:4300211. [PMID: 27170887 PMCID: PMC4848046 DOI: 10.1109/jtehm.2014.2375820] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2014] [Revised: 09/20/2014] [Accepted: 11/22/2014] [Indexed: 11/07/2022]
Abstract
Microarrays have now gone from obscurity to being almost ubiquitous in biological research. At the same time, the statistical methodology for microarray analysis has progressed from simple visual assessments of results to novel algorithms for analyzing changes in expression profiles. In a micro-RNA (miRNA) or gene-expression profiling experiment, the expression levels of thousands of genes/miRNAs are simultaneously monitored to study the effects of certain treatments, diseases, and developmental stages on their expressions. Microarray-based gene expression profiling can be used to identify genes, whose expressions are changed in response to pathogens or other organisms by comparing gene expression in infected to that in uninfected cells or tissues. Recent studies have revealed that patterns of altered microarray expression profiles in cancer can serve as molecular biomarkers for tumor diagnosis, prognosis of disease-specific outcomes, and prediction of therapeutic responses. Microarray data sets containing expression profiles of a number of miRNAs or genes are used to identify biomarkers, which have dysregulation in normal and malignant tissues. However, small sample size remains a bottleneck to design successful classification methods. On the other hand, adequate number of microarray data that do not have clinical knowledge can be employed as additional source of information. In this paper, a combination of kernelized fuzzy rough set (KFRS) and semisupervised support vector machine (S(3)VM) is proposed for predicting cancer biomarkers from one miRNA and three gene expression data sets. Biomarkers are discovered employing three feature selection methods, including KFRS. The effectiveness of the proposed KFRS and S(3)VM combination on the microarray data sets is demonstrated, and the cancer biomarkers identified from miRNA data are reported. Furthermore, biological significance tests are conducted for miRNA cancer biomarkers.
Collapse
|
21
|
Charuvaka A, Rangwala H. Classifying Protein Sequences Using Regularized Multi-Task Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:1087-1098. [PMID: 26357046 DOI: 10.1109/tcbb.2014.2338303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Classification problems in which several learning tasks are organized hierarchically pose a special challenge because the hierarchical structure of the problems needs to be considered. Multi-task learning (MTL) provides a framework for dealing with such interrelated learning tasks. When two different hierarchical sources organize similar information, in principle, this combined knowledge can be exploited to further improve classification performance. We have studied this problem in the context of protein structure classification by integrating the learning process for two hierarchical protein structure classification database, SCOP and CATH. Our goal is to accurately predict whether a given protein belongs to a particular class in these hierarchies using only the amino acid sequences. We have utilized the recent developments in multi-task learning to solve the interrelated classification problems. We have also evaluated how the various relationships between tasks affect the classification performance. Our evaluations show that learning schemes in which both the classification databases are used outperform the schemes which utilize only one of them.
Collapse
|
22
|
Fuzzy Preference Based Feature Selection and Semisupervised SVM for Cancer Classification. IEEE Trans Nanobioscience 2014; 13:152-60. [DOI: 10.1109/tnb.2014.2312132] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
23
|
Kuksa PP. Biological sequence classification with multivariate string kernels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1201-1210. [PMID: 24384708 DOI: 10.1109/tcbb.2013.15] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
String kernel-based machine learning methods have yielded great success in practical tasks of structured/sequential data analysis. They often exhibit state-of-the-art performance on many practical tasks of sequence analysis such as biological sequence classification, remote homology detection, or protein superfamily and fold prediction. However, typical string kernel methods rely on the analysis of discrete 1D string data (e.g., DNA or amino acid sequences). In this paper, we address the multiclass biological sequence classification problems using multivariate representations in the form of sequences of features vectors (as in biological sequence profiles, or sequences of individual amino acid physicochemical descriptors) and a class of multivariate string kernels that exploit these representations. On three protein sequence classification tasks, the proposed multivariate representations and kernels show significant 15-20 percent improvements compared to existing state-of-the-art sequence classification methods.
Collapse
|
24
|
Yu G, Rangwala H, Domeniconi C, Zhang G, Yu Z. Protein function prediction using multilabel ensemble classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1045-57. [PMID: 24334396 DOI: 10.1109/tcbb.2013.111] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
High-throughput experimental techniques produce several kinds of heterogeneous proteomic and genomic data sets. To computationally annotate proteins, it is necessary and promising to integrate these heterogeneous data sources. Some methods transform these data sources into different kernels or feature representations. Next, these kernels are linearly (or nonlinearly) combined into a composite kernel. The composite kernel is utilized to develop a predictive model to infer the function of proteins. A protein can have multiple roles and functions (or labels). Therefore, multilabel learning methods are also adapted for protein function prediction. We develop a transductive multilabel classifier (TMC) to predict multiple functions of proteins using several unlabeled proteins. We also propose a method called transductive multilabel ensemble classifier (TMEC) for integrating the different data sources using an ensemble approach. The TMEC trains a graph-based multilabel classifier on each single data source, and then combines the predictions of the individual classifiers. We use a directed birelational graph to capture the relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the effectiveness of the TMC and TMEC to predict the functions of proteins on three benchmarks. We show that our approaches perform better than recently proposed protein function prediction methods on composite and multiple kernels. The code, data sets used in this paper and supplemental material are available at https://sites.google.com/site/guoxian85/tmec.
Collapse
Affiliation(s)
- Guoxian Yu
- Southwest University, Beibei and South China University of Technology, Guangzhou
| | | | | | - Guoji Zhang
- South China University of Technology, Guangzhou
| | - Zhiwen Yu
- South China University of Technology, Guangzhou
| |
Collapse
|
25
|
Hamp T, Goldberg T, Rost B. Accelerating the Original Profile Kernel. PLoS One 2013; 8:e68459. [PMID: 23825697 PMCID: PMC3688983 DOI: 10.1371/journal.pone.0068459] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2013] [Accepted: 05/31/2013] [Indexed: 11/19/2022] Open
Abstract
One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.
Collapse
Affiliation(s)
- Tobias Hamp
- Bioinformatics & Computational Biology - I12, Department of Informatics, Technical University of Munich, Garching/Munich, Germany
| | - Tatyana Goldberg
- Bioinformatics & Computational Biology - I12, Department of Informatics, Technical University of Munich, Garching/Munich, Germany
- Center of Doctoral Studies in Informatics and Its Applications (CeDoSIA), Technical University of Munich Graduate School, Garching/Munich, Germany
| | - Burkhard Rost
- Bioinformatics & Computational Biology - I12, Department of Informatics, Technical University of Munich, Garching/Munich, Germany
- Institute of Advanced Study (TUM-IAS), Garching/Munich, Germany
- New York Consortium on Membrane Protein Structure (NYCOMPS) and Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York, United States of America
- * E-mail:
| |
Collapse
|
26
|
Abstract
BACKGROUND Understanding the localization of proteins in cells is vital to characterizing their functions and possible interactions. As a result, identifying the (sub)cellular compartment within which a protein is located becomes an important problem in protein classification. This classification issue thus involves predicting labels in a dataset with a limited number of labeled data points available. By utilizing a graph representation of protein data, random walk techniques have performed well in sequence classification and functional prediction; however, this method has not yet been applied to protein localization. Accordingly, we propose a novel classifier in the site prediction of proteins based on random walks on a graph. RESULTS We propose a graph theory model for predicting protein localization using data generated in yeast and gram-negative (Gneg) bacteria. We tested the performance of our classifier on the two datasets, optimizing the model training parameters by varying the laziness values and the number of steps taken during the random walk. Using 10-fold cross-validation, we achieved an accuracy of above 61% for yeast data and about 93% for gram-negative bacteria. CONCLUSIONS This study presents a new classifier derived from the random walk technique and applies this classifier to investigate the cellular localization of proteins. The prediction accuracy and additional validation demonstrate an improvement over previous methods, such as support vector machine (SVM)-based classifiers.
Collapse
Affiliation(s)
- Xiaohua Xu
- Department of Computer Science, Yangzhou University, Yangzhou 225009, China
| | - Lin Lu
- Department of Computer Science, Yangzhou University, Yangzhou 225009, China
| | - Ping He
- Department of Computer Science, Yangzhou University, Yangzhou 225009, China
| | - Ling Chen
- Department of Computer Science, Yangzhou University, Yangzhou 225009, China
| |
Collapse
|
27
|
Maulik U, Mukhopadhyay A, Chakraborty D. Gene-Expression-Based Cancer Subtypes Prediction Through Feature Selection and Transductive SVM. IEEE Trans Biomed Eng 2013; 60:1111-7. [DOI: 10.1109/tbme.2012.2225622] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
28
|
Maulik U, Sarkar A. Searching remote homology with spectral clustering with symmetry in neighborhood cluster kernels. PLoS One 2013; 8:e46468. [PMID: 23457439 PMCID: PMC3574063 DOI: 10.1371/journal.pone.0046468] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2011] [Accepted: 09/04/2012] [Indexed: 11/18/2022] Open
Abstract
UNLABELLED Remote homology detection among proteins utilizing only the unlabelled sequences is a central problem in comparative genomics. The existing cluster kernel methods based on neighborhoods and profiles and the Markov clustering algorithms are currently the most popular methods for protein family recognition. The deviation from random walks with inflation or dependency on hard threshold in similarity measure in those methods requires an enhancement for homology detection among multi-domain proteins. We propose to combine spectral clustering with neighborhood kernels in Markov similarity for enhancing sensitivity in detecting homology independent of "recent" paralogs. The spectral clustering approach with new combined local alignment kernels more effectively exploits the unsupervised protein sequences globally reducing inter-cluster walks. When combined with the corrections based on modified symmetry based proximity norm deemphasizing outliers, the technique proposed in this article outperforms other state-of-the-art cluster kernels among all twelve implemented kernels. The comparison with the state-of-the-art string and mismatch kernels also show the superior performance scores provided by the proposed kernels. Similar performance improvement also is found over an existing large dataset. Therefore the proposed spectral clustering framework over combined local alignment kernels with modified symmetry based correction achieves superior performance for unsupervised remote homolog detection even in multi-domain and promiscuous domain proteins from Genolevures database families with better biological relevance. Source code available upon request. CONTACT sarkar@labri.fr.
Collapse
Affiliation(s)
- Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India.
| | | |
Collapse
|
29
|
Wang S, Huang Q, Jiang S, Tian Q, Qin L. Nearest-neighbor method using multiple neighborhood similarities for social media data mining. Neurocomputing 2012. [DOI: 10.1016/j.neucom.2011.06.039] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
30
|
Abstract
In this paper we introduce online semi-supervised growing neural gas (OSSGNG), a novel online semi-supervised classification approach based on growing neural gas (GNG). Existing semi-supervised classification approaches based on GNG require that the training data is explicitly stored as the labeling is performed a posteriori after the training phase. As main contribution, we present an approach that relies on online labeling and prediction functions to process labeled and unlabeled data uniformly and in an online fashion, without the need to store any of the training examples explicitly. We show that using on-the-fly labeling strategies does not significantly deteriorate the performance of classifiers based on GNG, while circumventing the need to explicitly store training examples. Armed with this result, we then present a semi-supervised extension of GNG (OSSGNG) that relies on the above mentioned online labeling functions to label unlabeled examples and incorporate them into the model on-the-fly. As an important result, we show that OSSGNG performs as good as previous semi-supervised extensions of GNG which rely on offline labeling strategies. We also show that OSSGNG compares favorably to other state-of-the-art semi-supervised learning approaches on standard benchmarking datasets.
Collapse
Affiliation(s)
- OLIVER BEYER
- Semantic Computing Group, CITEC, Bielefeld University, Bielefeld, Germany
| | - PHILIPP CIMIANO
- Semantic Computing Group, CITEC, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
31
|
Recursive weighted kernel regression for semi-supervised soft-sensing modeling of fed-batch processes. J Taiwan Inst Chem Eng 2012. [DOI: 10.1016/j.jtice.2011.06.002] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
32
|
Mutual or Unrequited Love: Identifying Stable Clusters in Social Networks with Uni- and Bi-directional Links. LECTURE NOTES IN COMPUTER SCIENCE 2012. [DOI: 10.1007/978-3-642-30541-2_9] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
33
|
Bespalov D, Qi Y, Bai B, Shokoufandeh A. Sentiment Classification with Supervised Sequence Embedding. ACTA ACUST UNITED AC 2012. [DOI: 10.1007/978-3-642-33460-3_16] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
|
34
|
Nguyen TP, Ho TB. Detecting disease genes based on semi-supervised learning and protein-protein interaction networks. Artif Intell Med 2011; 54:63-71. [PMID: 22000346 DOI: 10.1016/j.artmed.2011.09.003] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2009] [Revised: 05/24/2011] [Accepted: 09/01/2011] [Indexed: 11/19/2022]
Abstract
OBJECTIVE Predicting or prioritizing the human genes that cause disease, or "disease genes", is one of the emerging tasks in biomedicine informatics. Research on network-based approach to this problem is carried out upon the key assumption of "the network-neighbour of a disease gene is likely to cause the same or a similar disease", and mostly employs data regarding well-known disease genes, using supervised learning methods. This work aims to find an effective method to exploit the disease gene neighbourhood and the integration of several useful omics data sources, which potentially enhance disease gene predictions. METHODS We have presented a novel method to effectively predict disease genes by exploiting, in the semi-supervised learning (SSL) scheme, data regarding both disease genes and disease gene neighbours via protein-protein interaction network. Multiple proteomic and genomic data were integrated from six biological databases, including Universal Protein Resource, Interologous Interaction Database, Reactome, Gene Ontology, Pfam, and InterDom, and a gene expression dataset. RESULTS By employing a 10 times stratified 10-fold cross validation, the SSL method performs better than the k-nearest neighbour method and the support vector machines method in terms of sensitivity of 85%, specificity of 79%, precision of 81%, accuracy of 82%, and a balanced F-function of 83%. The other comparative experimental evaluations demonstrate advantages of the proposed method given a small amount of labeled data with accuracy of 78%. We have applied the proposed method to detect 572 putative disease genes, which are biologically validated by some indirect ways. CONCLUSION Semi-supervised learning improved ability to study disease genes, especially a specific disease when the known disease genes (as labeled data) are very often limited. In addition to the computational improvement, the analysis of predicted disease proteins indicates that the findings are beneficial in deciphering the pathogenic mechanisms.
Collapse
|
35
|
Shi M, Zhang B. Semi-supervised learning improves gene expression-based prediction of cancer recurrence. ACTA ACUST UNITED AC 2011; 27:3017-23. [PMID: 21893520 DOI: 10.1093/bioinformatics/btr502] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
MOTIVATION Gene expression profiling has shown great potential in outcome prediction for different types of cancers. Nevertheless, small sample size remains a bottleneck in obtaining robust and accurate classifiers. Traditional supervised learning techniques can only work with labeled data. Consequently, a large number of microarray data that do not have sufficient follow-up information are disregarded. To fully leverage all of the precious data in public databases, we turned to a semi-supervised learning technique, low density separation (LDS). RESULTS Using a clinically important question of predicting recurrence risk in colorectal cancer patients, we demonstrated that (i) semi-supervised classification improved prediction accuracy as compared with the state of the art supervised method SVM, (ii) performance gain increased with the number of unlabeled samples, (iii) unlabeled data from different institutes could be employed after appropriate processing and (iv) the LDS method is robust with regard to the number of input features. To test the general applicability of this semi-supervised method, we further applied LDS on human breast cancer datasets and also observed superior performance. Our results demonstrated great potential of semi-supervised learning in gene expression-based outcome prediction for cancer patients. CONTACT bing.zhang@vanderbilt.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mingguang Shi
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN 37232, USA
| | | |
Collapse
|
36
|
Conotoxin protein classification using free scores of words and support vector machines. BMC Bioinformatics 2011; 12:217. [PMID: 21619696 PMCID: PMC3133552 DOI: 10.1186/1471-2105-12-217] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2010] [Accepted: 05/29/2011] [Indexed: 11/23/2022] Open
Abstract
Background Conotoxin has been proven to be effective in drug design and could be used to treat various disorders such as schizophrenia, neuromuscular disorders and chronic pain. With the rapidly growing interest in conotoxin, accurate conotoxin superfamily classification tools are desirable to systematize the increasing number of newly discovered sequences and structures. However, despite the significance and extensive experimental investigations on conotoxin, those tools have not been intensively explored. Results In this paper, we propose to consider suboptimal alignments of words with restricted length. We developed a scoring system based on local alignment partition functions, called free score. The scoring system plays the key role in the feature extraction step of support vector machine classification. In the classification of conotoxin proteins, our method, SVM-Freescore, features an improved sensitivity and specificity by approximately 5.864% and 3.76%, respectively, over previously reported methods. For the generalization purpose, SVM-Freescore was also applied to classify superfamilies from curated and high quality database such as ConoServer. The average computed sensitivity and specificity for the superfamily classification were found to be 0.9742 and 0.9917, respectively. Conclusions The SVM-Freescore method is shown to be a useful sequence-based analysis tool for functional and structural characterization of conotoxin proteins. The datasets and the software are available at http://faculty.uaeu.ac.ae/nzaki/SVM-Freescore.htm.
Collapse
|
37
|
Gui J, Wang SL, Lei YK. Multi-step dimensionality reduction and semi-supervised graph-based tumor classification using gene expression data. Artif Intell Med 2011; 50:181-91. [PMID: 20599367 DOI: 10.1016/j.artmed.2010.05.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2009] [Revised: 04/28/2010] [Accepted: 05/18/2010] [Indexed: 11/29/2022]
Abstract
OBJECTIVE Both supervised methods and unsupervised methods have been widely used to solve the tumor classification problem based on gene expression profiles. This paper introduces a semi-supervised graph-based method for tumor classification. Feature extraction plays a key role in tumor classification based on gene expression profiles, and can greatly improve the performance of a classifier. In this paper we propose a novel multi-step dimensionality reduction method for extracting tumor-related features. METHODS AND MATERIALS First the Wilcoxon rank-sum test is used for gene selection. Then gene ranking and discrete cosine transform are combined with principal component analysis for feature extraction. Finally, the performance is evaluated by semi-supervised learning algorithms. RESULTS To show the validity of the proposed method, we apply it to classify four tumor datasets involving various human normal and tumor tissue samples. The experimental results show that the proposed method is efficient and feasible. Compared with other methods, our method can achieve relatively higher prediction accuracy. Particularly, it is found that semi-supervised method is superior to support vector machines in classification performance. CONCLUSIONS The proposed approach can effectively improve the performance of tumor classification based on gene expression profiles. This work is a meaningful attempt to explore and apply multi-step dimensionality reduction and semi-supervised learning methods in the field of tumor classification. Considering the high classification accuracy, there should be much room for the application of multi-step dimensionality reduction and semi-supervised learning methods to perform tumor classification.
Collapse
Affiliation(s)
- Jie Gui
- Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China.
| | | | | |
Collapse
|
38
|
|
39
|
Caragea C, Caragea D, Silvescu A, Honavar V. Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models. BMC Bioinformatics 2010; 11 Suppl 8:S6. [PMID: 21034431 PMCID: PMC2966293 DOI: 10.1186/1471-2105-11-s8-s6] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. RESULTS In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). CONCLUSIONS The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.
Collapse
Affiliation(s)
- Cornelia Caragea
- Artificial Intelligence Research Laboratory, Department of Computer Science,Iowa State University, Ames, IA 50010, USA.
| | | | | | | |
Collapse
|
40
|
Toussaint NC, Widmer C, Kohlbacher O, Rätsch G. Exploiting physico-chemical properties in string kernels. BMC Bioinformatics 2010; 11 Suppl 8:S7. [PMID: 21034432 PMCID: PMC2966294 DOI: 10.1186/1471-2105-11-s8-s7] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas. RESULTS We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels. CONCLUSIONS In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference. AVAILABILITY Data sets, code and additional information are available from http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask. Implementations of the developed kernels are available as part of the Shogun toolbox.
Collapse
Affiliation(s)
- Nora C Toussaint
- Center for Bioinformatics, Eberhard-Karls-Universität, Sand 14, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|
41
|
Santos MA, Turinsky AL, Ong S, Tsai J, Berger MF, Badis G, Talukder S, Gehrke AR, Bulyk ML, Hughes TR, Wodak SJ. Objective sequence-based subfamily classifications of mouse homeodomains reflect their in vitro DNA-binding preferences. Nucleic Acids Res 2010; 38:7927-42. [PMID: 20705649 PMCID: PMC3001082 DOI: 10.1093/nar/gkq714] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Classifying proteins into subgroups with similar molecular function on the basis of sequence is an important step in deriving reliable functional annotations computationally. So far, however, available classification procedures have been evaluated against protein subgroups that are defined by experts using mainly qualitative descriptions of molecular function. Recently, in vitro DNA-binding preferences to all possible 8-nt DNA sequences have been measured for 178 mouse homeodomains using protein-binding microarrays, offering the unprecedented opportunity of evaluating the classification methods against quantitative measures of molecular function. To this end, we automatically derive homeodomain subtypes from the DNA-binding data and independently group the same domains using sequence information alone. We test five sequence-based methods, which use different sequence-similarity measures and algorithms to group sequences. Results show that methods that optimize the classification robustness reflect well the detailed functional specificity revealed by the experimental data. In some of these classifications, 73–83% of the subfamilies exactly correspond to, or are completely contained in, the function-based subtypes. Our findings demonstrate that certain sequence-based classifications are capable of yielding very specific molecular function annotations. The availability of quantitative descriptions of molecular function, such as DNA-binding data, will be a key factor in exploiting this potential in the future.
Collapse
Affiliation(s)
- Miguel A Santos
- Molecular Structure and Function Program, Hospital for Sick Children, Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
|
43
|
Shen YQ, Lang BF, Burger G. Diversity and dispersal of a ubiquitous protein family: acyl-CoA dehydrogenases. Nucleic Acids Res 2009; 37:5619-31. [PMID: 19625492 PMCID: PMC2761260 DOI: 10.1093/nar/gkp566] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Acyl-CoA dehydrogenases (ACADs), which are key enzymes in fatty acid and amino acid catabolism, form a large, pan-taxonomic protein family with at least 13 distinct subfamilies. Yet most reported ACAD members have no subfamily assigned, and little is known about the taxonomic distribution and evolution of the subfamilies. In completely sequenced genomes from approximately 210 species (eukaryotes, bacteria and archaea), we detect ACAD subfamilies by rigorous ortholog identification combining sequence similarity search with phylogeny. We then construct taxonomic subfamily-distribution profiles and build phylogenetic trees with orthologous proteins. Subfamily profiles provide unparalleled insight into the organisms’ energy sources based on genome sequence alone and further predict enzyme substrate specificity, thus generating explicit working hypotheses for targeted biochemical experimentation. Eukaryotic ACAD subfamilies are traditionally considered as mitochondrial proteins, but we found evidence that in fungi one subfamily is located in peroxisomes and participates in a distinct β-oxidation pathway. Finally, we discern horizontal transfer, duplication, loss and secondary acquisition of ACAD genes during evolution of this family. Through these unorthodox expansion strategies, the ACAD family is proficient in utilizing a large range of fatty acids and amino acids—strategies that could have shaped the evolutionary history of many other ancient protein families.
Collapse
Affiliation(s)
- Yao-Qing Shen
- Robert Cedergren Center for Bioinformatics and Genomics, Biochemistry Department, Université de Montréal, 2900 Edouard-Montpetit, Montreal, QC, H3T 1J4, Canada.
| | | | | |
Collapse
|
44
|
Min R, Bonner A, Li J, Zhang Z. Learned random-walk kernels and empirical-map kernels for protein sequence classification. J Comput Biol 2009; 16:457-74. [PMID: 19254184 DOI: 10.1089/cmb.2008.0031] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Biological sequence classification (such as protein remote homology detection) solely based on sequence data is an important problem in computational biology, especially in the current genomics era, when large amount of sequence data are becoming available. Support vector machines (SVMs) based on mismatch string kernels were previously applied to solve this problem, achieving reasonable success. However, they still perform poorly on difficult protein families. In this paper, we propose two approaches to solve the protein remote homology detection problem: one uses a convex combination of random-walk kernels to approximate the random-walk kernel with the optimal random steps, and the other constructs an empirical-map kernel using a profile kernel. Both resulting kernels make use of a large number of pairwise sequence similarity information and unlabeled data; and have much better prediction performance than the best profile kernel directly derived from protein sequences. On a competitive Structural Classification Of Proteins (SCOP) benchmark dataset, the overall mean ROC(50) scores on 54 protein families we obtained using both approaches are above 0.90, which significantly outperform previous published results.
Collapse
Affiliation(s)
- Renqiang Min
- Department of Computer Science, University of Toronto, Toronto, Canada
| | | | | | | |
Collapse
|
45
|
Kuksa P, Huang PH, Pavlovic V. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics 2009; 10 Suppl 4:S2. [PMID: 19426450 PMCID: PMC2681072 DOI: 10.1186/1471-2105-10-s4-s2] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. RESULTS Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. CONCLUSION The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.
Collapse
Affiliation(s)
- Pavel Kuksa
- Department of Computer Science, Rutgers University, Piscataway, NJ, 08854, USA
| | - Pai-Hsi Huang
- Department of Computer Science, Rutgers University, Piscataway, NJ, 08854, USA
| | - Vladimir Pavlovic
- Department of Computer Science, Rutgers University, Piscataway, NJ, 08854, USA
| |
Collapse
|
46
|
|
47
|
Jung I, Kim D. SIMPRO: simple protein homology detection method by using indirect signals. Bioinformatics 2009; 25:729-35. [DOI: 10.1093/bioinformatics/btp048] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
48
|
Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection. BMC Bioinformatics 2008; 9:298. [PMID: 18590572 PMCID: PMC2459191 DOI: 10.1186/1471-2105-9-298] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2008] [Accepted: 07/01/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features. This unique ability makes NMF a potentially promising method for biological sequence analysis. Here, we apply NMF to fold recognition and remote homolog detection problems. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably. However, it is not clear which parts of sequences are essential for the performance improvement. RESULTS The performance of fold recognition and remote homolog detection using NMF features is compared to that of the unmodified profile-profile alignment (PPA) features by estimating Receiver Operating Characteristic (ROC) scores. The overall performance is noticeably improved. For fold recognition at the fold level, SVM with NMF features recognize 30% of homolog proteins at > 0.99 ROC scores, while original PPA feature, HHsearch, and PSI-BLAST recognize almost none. For detecting remote homologs that are related at the superfamily level, NMF features also achieve higher performance than the original PPA features. At > 0.90 ROC50 scores, 25% of proteins with NMF features correctly detects remotely related proteins, whereas using original PPA features only 1% of proteins detect remote homologs. In addition, we investigate the effect of number of positive training examples and the number of basis vectors on performance improvement. We also analyze the ability of NMF to extract essential features by comparing NMF basis vectors with functionally important sites and structurally conserved regions of proteins. The results show that NMF basis vectors have significant overlap with functional sites from PROSITE and with structurally conserved regions from the multiple structural alignments generated by MUSTANG. The correlation between NMF basis vectors and biologically essential parts of proteins supports our conjecture that NMF basis vectors can explicitly represent important sites of proteins. CONCLUSION The present work demonstrates that applying NMF to profile-profile alignments can reveal essential features of proteins and that these features significantly improve the performance of fold recognition and remote homolog detection.
Collapse
|
49
|
Shah AR, Oehmen CS, Webb-Robertson BJ. SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection. Bioinformatics 2008; 24:783-90. [DOI: 10.1093/bioinformatics/btn028] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
50
|
Mitra J, Mundra P, Kulkarni BD, Jayaraman VK. Using Recurrence Quantification Analysis Descriptors for Protein Sequence Classification with Support Vector Machines. J Biomol Struct Dyn 2007; 25:289-98. [DOI: 10.1080/07391102.2007.10507177] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|